Multimodal Data with Modern Tools - #SFTechWeek
Modern problems = Modern solutions!
Modern AI/ML workloads require data infrastructure that is capable of handling the complexity of multimodal data. AI/ML is no longer about simple tabular features or clickstream data — modern AI/ML demands data infrastructure that can handle messy unstructured text, documents, images and even video.
Meet the team behind the Daft project and a panel of experts at the forefront of these new technologies that make up the next generation of scalable AI/ML systems. We will be exploring new and exciting work from storage to data curation, large model training, and evaluation.
Thank you to our sponsor CRV for providing foods & drinks for this meetup!
This event is a part of #SFTechWeek - a week of events hosted by VCs and startups to bring together the tech ecosystem.
Agenda
5:00p - 5:45p: Doors Open & Networking 💃
5:45p - 6:50p: Welcome Remarks & Presentations!
6:50p - 8:30p: More Networking 🕺
About Daft
Daft is an open source framework that powers ETL, analytics, and ML/AI at scale. Its familiar Dataframe API is built to outperform Spark in performance and ease of use.
💬 Join Distributed Data Community Slack
📚 Check out Daft Engineering Blog
📲 Follow Daft on LinkedIn & Twitter
🖥️ Subscribe to Daft YouTube
💜 We’re hiring, join our team
Presentations
🌟 Distributed Data Tools Should Be Easy To Use: Entreaties From an End-User
The Twelve Labs team has been hard at work scaling multimodal AI application inference, and will discuss their recent work prototyping Ray Serve as a foundation for such workloads! Ray is a great tool, but building applications on Ray Serve exposed a number of challenges that could have been avoided with a more streamlined local development experience. The Twelve Labs team will share tips and tricks for developing distributed AI applications locally, including how to overcome scale-down limitations in "exascale first" platforms (Ray, Spark, etc.). Data product folks, please take note :)
Stu (Michael) Stewart is the Head of ML/AI Engineering at Twelve Labs, where he and his team are building foundation models for video search and video guided language generation. Previously, he worked on autonomous vehicles at Cruise, and on home pricing algorithms at Opendoor. Twelve Labs is hiring! https://www.twelvelabs.io/careers
Paul George is a Senior Staff ML Engineer at Twelve Labs working on data and inference infrastructure. Previously, he worked on data infrastructure and pricing algorithms at Perpetua Labs and Opendoor before that.
🌟 Why Multimodal Data Requires New Tools
Existing query engines like Spark and Trino have historically excelled at processing analytical data at scale, however they are a poor choice for handling the complexities of unstructured or multimodal data.
In this talk, we will uncover some of the challenges these engines face when processing multimodal data and introduce how we designed Daft to solve many of these problems in a distributed fashion. Dive into the internals of Daft and its architecture, discover why we chose Rust to power our fast and distributed Python query engine, and unlock new workloads and possibilities for multimodal by leveraging Daft.
Sammy Sidhu is the co-founder and CEO of Eventual, the company behind Daft. Sammy's background is in High Performance Computing (HPC) and Deep Learning and has over a dozen patents/publications in the space. Prior to Eventual, Sammy worked in Autonomous driving for 6 years and sold a startup to Tesla Autopilot in the process.
🌟 A New Open Source Foundation for AI Data
The current data stack is built on top of foundations laid down a decade ago for tabular data. But AI datasets are much more complex and workloads are much more diverse. Enterprises scaling AI in production often find data management prohibitively expensive and overly complicated. Lance columnar format is an open-source project designed to provide the new data foundations for AI, delivering much better performance and scalability for AI datasets, and makes them natively searchable using vector or full text queries. In this talk we’ll dive into the main challenges that AI data poses, how Lance format works, and the value it delivers to AI teams training models or putting applications into production.
Chang She is the CEO and cofounder of LanceDB, the developer-friendly, open-source database for multi-modal AI. A serial entrepreneur, Chang has been building DS/ML tooling for nearly two decades and is one of the original contributors to the pandas library. Prior to founding LanceDB, Chang was VP of Engineering at TubiTV, where he focused on personalized recommendations and ML experimentation.
🌟 Unity Catalog: the Universal Catalog for Data + AI
Come and discover how Unity Catalog provides a unified solution to manage your unstructured data, like images and documents, and GenAI tools with a single, universal catalog for data + AI. Unity Catalog is multimodal and provides interoperability across lakehouse formats like Delta and Iceberg, and various compute engines. It comes with built-in governance and security, including strong authentication, secure credential vending, and asset-level access control, to protect your data and AI assets. In this talk, we’ll present an overview of Unity Catalog and showcase its multimodal capabilities.
Ramesh Chandra is a Principal Engineer at Databricks, building Unity Catalog and Governance. HIs background is in distributed systems, storage, and security. Previously, he was a tech lead for the Cloud AI platform and Cloud Identity teams at Google, and built distributed storage systems at Nutanix.