Efficient LLM Inference and Serving with vLLM
LLMs are Large. This makes them hard to load, fine-tune, and thus, serve and perform inference on. We’ve talked recently about quantization and LoRA/qLoRA, techniques that help us to load and fine-tune models, and here we’ll break down vLLM, an up-and-coming open-source LLM inference and serving engine!
In this event, we’ll kick off by breaking down the basic intuition of both inference and serving before we dive into the details of vLLM, the library that promises easy, fast, and cheap LLM serving. Then, we’ll dig into the “secret sauce" of vLLM, the PagedAttention algorithm, including its inspiration, intuition, and the details of how it helps to efficiently manage the key and value tensors (the “KV cache”) during linear dot production self-attention calculations as input sequences flow through transformer attention blocks. After we go high-level, we’ll cover the system components of vLLM that allow to it process requests and generate outputs, and of course, we'll dive into a live demo with code! Finally, we’ll break down what we are expecting from vLLM next and where we think it should fit into your AI Engineering workflow in 2024!
Join us live to learn:
The basics of serving and inference of Large Language Models
How vLLM relieves memory bottlenecks that limit serving performance at scale
The PagedAttention algorithm, and how it compares to popular Hugging Face libraries
Join us live to learn:
The basics of serving and inference of Large Language Models
How vLLM relieves memory bottlenecks that limit serving performance at scale
The PagedAttention algorithm, and how it compares to popular Hugging Face libraries
Speakers
Dr. Greg Loughnane is the Co-Founder & CEO of AI Makerspace, where he is an instructor for their LLM Ops: LLMs in Production course. Since 2021 he has built and led industry-leading Machine Learning & AI boot camp programs. Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and ML researcher. He loves trail running and is based in Dayton, Ohio.
Chris Alexiuk, is the Co-Founder & CTO at AI Makerspace, where he is an instructor for their LLM Ops: LLMs in Production course. Previously, he worked as both a Founding MLE and a Data Scientist. As an experienced online instructor, curriculum developer, and YouTube creator, he’s always learning, building, shipping, and sharing his work! He loves Dungeons & Dragons and is based in Toronto, Canada.
Follow AI Makerspace on LinkedIn & YouTube to stay updated with workshops, new courses, and opportunities for corporate training.