vLLM: Virtual LLM
vLLM’s goal is to build the fastest and easiest-to-use open-source LLM inference & serving engine.
Recently, we’ve learned about inference optimization techniques like GPTQ, AWQ, and VPTQ.
Whereas quantization techniques focus on reducing the precision of model weights and activations to lower memory usage and computation costs, vLLM focuses on memory management and throughput optimization without altering the underlying model weights or computations.
We’ve also covered how to speed up attention calculations with FA2.
Whereas FA and FA2 reduce the computational cost of attention mechanisms, which scale quadratically with sequence length in standard transformers, vLLM optimizes how memory is allocated and managed during these computations.
After learning these constituent pieces, we can start to see a clearer picture of the complete package that vLLM is building.
It’s important to think about vLLM as a complete package for efficient inference AND serving.
Let’s use an analogy to try to understand vLLM:
Imagine you're organizing a library with books that represent the data and operations of a large language model.
The attention mechanism is like a librarian who helps you find the most relevant books and sections for your research.
Quantization is like replacing heavy hardcover books with condensed paperback versions.
vLLM is like a smart library system that reorganizes bookshelves, reading areas, and check-out counters to maximize the space and efficiency of the library.
You don't waste space by keeping empty shelves (memory fragmentation).
Readers can access multiple books at once (high throughput).
It dynamically adjusts the layout based on the number of visitors (varying sequence lengths or batch sizes).
We’ll dig into the details of vLLM in this one, to paint the whole picture, including reviewing the original paper and the blog.
📚 You’ll learn:
The intuition behind PagedAttention and how vLLM integrates with FlashAttention
How vLLM leverages quantization techniques like GPTQ, AWQ, and more to optimize inference
How to implement vLLM and when to consider picking it up for your next project
🤓 Who should attend the event:
Aspiring AI Engineers who want to get the big picture of the entire vLLM library and it’s capabilities
AI Engineering leaders who want to build production LLM applications, whether in the cloud or on-prem
Speakers
Dr. Greg” Loughnane is the Co-Founder & CEO of AI Makerspace, where he teaches thousands of students monthly on YouTube and instructs The AI Engineering Bootcamp and LLM Engineering: The Foundations. Since 2021 he has built and led industry-leading Machine Learning education programs. Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher. He loves trail running and is based in Dayton, Ohio.
Chris “The Wiz” Alexiuk is the Co-Founder & CTO at AI Makerspace, where he teaches thousands of students monthly on YouTube and instructs The AI Engineering Bootcamp and LLM Engineering: The Foundations. During the day, he is also a Developer Advocate at NVIDIA. Previously, he was a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.
Follow AI Makerspace on LinkedIn and YouTube to stay updated about workshops, new courses, and corporate training opportunities.