AI/ML Infra Meetup at Uber
In-person registration is closed. Please register for the live stream here: https://us06web.zoom.us/webinar/register/WN_eZUwTsgGSZuCZ9WGqMoczQ#/
Join leading AI/ML infrastructure experts for the AI/ML Infra Meetup hosted by Alluxio and Uber. This is a premier opportunity to engage and discuss the latest in ML pipeline, AI/ML infrastructure, LLM, RAG, GNN, GPU and more.
This meetup will be in person at Uber Sunnyvale and *live-streamed. Experts from Uber Engineering, NVIDIA, Alluxio & UChicago will give talks and share insights and real-world examples about optimizing data pipelines, accelerating model training and serving, designing scalable architectures, and more.
Immerse yourself with learning, networking, and conversations. Enjoy the mix and mingle happy hour in the end. Food and drinks are on us!
Agenda (in PDT)
Door opens at 4:00 pm
4:00 pm: Check-in, say hi, and enjoy snacks & soft drinks ✍️👋
5:00 pm: ML Explainability in Michelangelo, Uber’s ML Platform (Eric Wang, Senior Staff Software Engineer @ Uber) 🚕📈
5:20 pm: Improve Speed and GPU Utilization for Model Training & Serving (Lu Qiu & Siyuan Sheng, Tech Leads @ Alluxio) 🤖☁️
5:40 pm: Reducing Prefill for LLM Serving in RAG (Junchen Jiang, Assistant Professor of Computer Science @ University of Chicago) 🎓🏫
6:00 pm: My Perspective on Deep Learning Framework (Triston Cao, Senior Deep Learning Software Engineer Manager @ NVIDIA) 💬🔥
6:30-7:30 pm: Networking Happy Hour, including food and drinks 👯🍻
Please mark yourself "not going" if you have changed your mind and will not attend.
Due to capacity constraints, we will only be accepting the first 100 attendees on-site.
Looking forward to seeing you!
Join the Community
🧑💻 Join Alluxio Community Slack
🐦 Follow Alluxio on Linkedin & Twitter / X
📺 Subscribe to the Alluxio YouTube Channel
📖 Download the PyTorch Tuning Guide
Presentation Details
Talk 1: ML Explainability in Michelangelo, Uber’s ML Platform
Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes.
In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.
About the Speaker: Eric Wang is a software engineer at Uber's Michelangelo team since 2020, focused on maintaining high ML quality across all models and pipelines. Prior to this, he contributed to Uber's Marketplace Fares team from 2018 to 2020, developing fare systems for various services. Before that, he resided in Australia, and built a strong foundation in software engineering, working with notable companies including eBay, Qantas, and Equifax.
Talk 2: Improve Speed and GPU Utilization for Model Training & Serving
Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub.
In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn:
The data loading challenges hindering GPU utilization
The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT
Real-world examples of boosting model performance and GPU utilization through optimized data access
Talk 3: Reducing Prefill for LLM Serving in RAG
Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill's impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.
About the Speaker: Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He received his Ph.D. from CMU in 2017 and his bachelor’s degree from Tsinghua in 2011. His research interests are networked systems and their intersections with machine learning. He has received a Google Faculty Research Award, an NSF CAREER Award, and a CMU Computer Science Doctoral Dissertation Award. https://people.cs.uchicago.edu/~junchenj/
Talk 4: My Perspective on Deep Learning Framework
From Caffe to MXNet, to PyTorch, and more, Xiande Cao, Senior Deep Learning Software Engineer Manager, will share his perspective on the evolution of deep learning frameworks.
About the Speaker: Dr. Xiande (Triston) Cao is a Senior Deep Learning Software Engineering Manager at NVIDIA. He collaborates with the open-source community working on deep learning and graph neural networks, leveraging the NVIDIA software stack, GPUs, and AI systems to enhance the capabilities of AI. He received his PhD in Electric Engineering from the University of Kentucky.
If you have any questions, please contact Hope Wang at hope.wang@alluxio.com.