Bay Area Ray LLM Community Talks
Please join us for an evening of technical talks on Large Language Models (LLMs) inference with continuous batching and Aviary LLM Explorer, an open-source hosted service for exploring LLMs, including the latest Llama2, Falcon, Vicuna, and MosaicML MPT model series.
Also, we’ll share the latest feature-updates to Aviary LLM explorer and show Anyscale open source LLM endpoint offerings. Plus a special performance by a magician; making it a magical evening all around!
Agenda
(The times are not strict; they may vary slightly.)
6:00 - 6:30 pm: Networking, Snacks & Drinks
6:30 pm: Talk 1 (30-35 mins): 5 useful design patterns for production LLM Applications - Waleed Kadous, Anyscale
Q & A (10 mins)
Special Magician Performance!
7:15 pm: Talk 2 (30-35 mins): How continuous batching enables 23x throughput in LLM inference – Cade Daniel, Anyscale
Q & A (10 mins)
Talk 1: 5 useful design patterns for production LLM Applications
Abstract:
Over the last year at Anyscale, we’ve built a number of our own LLM Applications as well as helping our customers build them as well. Through that process we’ve come up with 5 design patterns or principles that we feel significantly increase the chances of success in building LLM applications. We discuss those 5 patterns and how to build them into your application. It will address:
1. What exactly is an LLM application is.
2. How to design for easy evaluation and testability.
3. Making LLM components reusable across applications.
Bio:
Dr. Waleed Kadous is the Chief Data Scientist at Anyscale, the company behind the popular open source project Ray. Prior to Anyscale, Waleed worked at Uber, where he led overall system architecture, evangelized machine learning, and led the Location and Maps teams. He previously worked at Google, where he founded the Android Location and Sensing team, responsible for the “blue dot” as well as ML algorithms underlying products like Google Fit. He also holds more than 40 patents.
Talk 2: How continuous batching enables 23x throughput in LLM inference
Abstract:
Due to the large GPU memory footprint and compute cost of LLMs, serving dominates the compute cost for most real world applications. ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are surprising system-level batching optimizations that make 10x or more differences in real-world workloads.
One recent such proposed optimization is continuous batching. In this talk we’ll discuss what it is, how it works, and how it enables a 23x improvement in throughput over naive HuggingFace transformers on a production workload (3x over previous SOTA).
Bio:
Cade Daniel is a software engineer at Anyscale working on Ray and LLMs. Previously, he helped build the communication engine for training large language models using AWS SageMaker's model parallelism library. Outside of work, he enjoys sipping a good latte while liking hot takes on ML/AI Twitter.