Inference & GPU Optimization: AWQ
Large Language Models (LLMs) keep getting bigger, and their performance across many tasks and problems keeps improving. We expect this to continue.
At the same time, they’re getting smaller, and their performance is improving across specific tasks and problems. Similarly, expect more of this in the future.
The performance vs. cost tradeoff when it comes to LLMs and Small Language Models (SLMs) must always consider computing requirements for two reasons:
Fine-Tuning & Alignment: Any fine-tuning and alignment work that must be done to commercial off-the-shelf or open-source LLMs
Inference: Right-sizing your model for the performance, task, and scale of the application
During inference, we must ensure the best possible output.
Optimizing our inference means ensuring the best output with the minimum possible compute.
One way to accomplish this is to leverage techniques like quantization, which compresses our LLM while maintaining a high-quality output. Specialized versions of quantization have been developed like Activate-aware Quantization, or AWQ. According to the authors of the original paper, AWQ is:
a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important
In this event, we’ll introduce inference optimization and investigate this insight, including the results and conclusions from the paper. We’ll also demonstrate the code behind using AWQ via Transformers.
📚 You’ll learn:
How quantization and AWQ help to speed up inference time and remove latency!
What it means for quantization to be “Activation-aware!”
🤓 Who should attend the event:
Aspiring AI Engineers who want to optimize inference for LLM applications
AI Engineering leaders who want to serve LLM applications at scale in production
Speakers:
Dr. Greg” Loughnane is the Co-Founder & CEO of AI Makerspace, where he is an instructor for The AI Engineering Bootcamp. Since 2021 he has built and led industry-leading Machine Learning education programs. Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher. He loves trail running and is based in Dayton, Ohio.
Chris “The Wiz” Alexiuk is the Co-Founder & CTO at AI Makerspace, where he is an instructor for The AI Engineering Bootcamp. During the day, he is also a Developer Advocate at NVIDIA. Previously, he was a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.
Follow AI Makerspace on LinkedIn and YouTube to stay updated about workshops, new courses, and corporate training opportunities.