Inference & GPU Optimization: VPTQ

Public AIM Events!

YouTube

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Large Language Models (LLMs) keep getting bigger, and their performance across many tasks and problems keeps improving. We expect this to continue.

At the same time, they’re getting smaller, and their performance is improving across specific tasks and problems. Similarly, expect more of this in the future.

The performance vs. cost tradeoff when it comes to LLMs and Small Language Models (SLMs) must always consider computing requirements for two reasons:

Fine-Tuning & Alignment: Any fine-tuning and alignment work that must be done to commercial off-the-shelf or open-source LLMs
Inference: Right-sizing your model for the performance, task, and scale of the application

During inference, we must ensure the best possible output.

Optimizing our inference means ensuring the best output with the minimum possible compute.

One way to accomplish this is to leverage techniques like quantization, which compresses our LLM while maintaining a high-quality output.

Specialized versions of quantization have been developed, and we cover these techniques in this series!

In Part I we covered Activation-aware Weight Quantization (AWQ). In Part II we covered Generative Pretrained Transformer Quantization (GPTQ). Now, in Part III, we look at Vector Post-Training Quantization (VPTQ).

In this event, we’ll follow the path from 8-bit Blockwise quantization (bitsandbytes), to 4-bit quantization (GPTQ), to Activation-aware Weight Quantization (AWQ), to now extremely low-bit quantization (<2-bit) with VPTQ.

VPTQ is capable of quantizing Meta Llama 3.1 405B @ < 2-bit, and the 70B version @ 2-bits while maintaining high accuracy across benchmarks.

How?

As the name suggests, VPTQ leverages vectors rather than classical numerical representations. It turns out that extremely low-bit quantization can be achieved by compressing vectors into indices using lookup tables.

The team from Microsoft also used “Second-Order Optimization” to formulate the Vector Quantization (VQ) problem and guide the quantization. The weights were further refined using “Channel-Independent Second-Order Optimization.” This decomposition of the optimization problem into distinct steps also led the authors to propose a “codebook initialization algorithm.”

The paper compares VPTQ to leading methods like AWQ and GPTQ, and claims to “balance all dimensions and achieve SOTA.”

We’ll break down what all of these words mean, and give VPTQ a test drive against the other leading methods we’ve looked at.

Join us live to cover concepts and code in Part III!

We’ll break down exactly what this language means, from concepts to code!

📚 You’ll learn:

How Vector Quantization (VQ) and the VPTQ method uses extremely low-bit representations
How VPTQ compares to AWQ and GPTQ
The key dimensions needed to compare LLM Quantization Algorithms

🤓 Who should attend the event:

Aspiring AI Engineers who want to optimize inference for LLM applications
AI Engineering leaders who want to serve LLM applications at scale in production

Speakers:

“Dr. Greg” Loughnane is the Co-Founder & CEO of AI Makerspace, where he is an instructor for The AI Engineering Bootcamp and LLM Engineering: The Foundations. Since 2021 he has built and led industry-leading Machine Learning education programs. Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher. He loves trail running and is based in Dayton, Ohio.
Chris “The Wiz” Alexiuk is the Co-Founder & CTO at AI Makerspace, where he is an instructor for The AI Engineering Bootcamp and LLM Engineering: The Foundations. During the day, he is also a Developer Advocate at NVIDIA. Previously, he was a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.

Follow AI Makerspace on LinkedIn and YouTube to stay updated about workshops, new courses, and corporate training opportunities.

Presented by

Public AIM Events!

Hosted By

31 Went