Cover Image for GPU Optimization Workshop

Presented by

Chip Huyen's events

Here are some of the events I'll be speaking at/hosting in the future. Subscribe to be notified of new events!

Hosted By

2,833 Went

GPU Optimization Workshop

Chip Huyen's events

YouTube

Sold Out

This event is sold out and no longer taking registrations.

About Event

We’re hosting a workshop on GPU optimization with stellar speakers from OpenAI, NVIDIA, Meta, and Voltron Data.

The event will be livestreamed on YouTube, and discussions will take place on Discord. Please see the workshop README for reading materials and more information.

Shared workshop note.

Our Zoom allows up to 100 people on the call. We made the option cost $1 to ensure we get people who are serious about learning to join.

[12:00] Crash course on GPU optimization (Mark Saroufim @ Meta)

Mark is a PyTorch core developer and cofounder of CUDA Mode. He also ran the really fun NeurIPS LLM Efficiency challenge last year. Previously, he was at Graphcore and Microsoft.

Mark will give an overview of GPUs and how their evolution motivated the creation of torch compile. This talk will give us the basics to understand the rest of the workshop focusing on the optimizations that matter like fusions, tensor cores, overhead reduction, quantization and custom kernels.

[12:45] High-performance LLM serving on GPUs (Sharan Chetlur @ NVIDIA)

Sharan is a principal engineer working on TensorRT-LLM at NVIDIA. He’s been working on CUDA since 2012, optimizing the performance of deep learning models from a single GPU to a full data center scale. Previously, he was the Director of Engineering at Cerebras.

Sharan will discuss how to build performant, flexible solutions to optimize LLM serving given the rapid evolution of new models and techniques. The talk will cover optimization techniques such as token concatenation, different strategies for batching, and cache.

[13:20] Block-based GPU Programming with Triton (Philippe Tillet @ OpenAI)

Philippe is currently leading the Triton team at OpenAI. Previously, he was at pretty much all major chip makers including NVIDIA, AMD, Intel, and Nervana.

Philippe will explain how Triton works and how its block-based programming model differs from the traditional single instruction, multiple threads (SIMT) programming model that CUDA follows. Triton aims to be higher-level than CUDA while being more expressive (lower-level) than common graph compilers like XLA and Torch-Inductor.

[14:00] Scaling data processing from CPU to distributed GPUs (William Malpica @ Voltron Data)

William is a co-founder of Voltron Data and the creator of BlazingSQL. He helped scale Theseus, a GPU-native query engine, to handle 100TB queries!

Most people today use GPUs for training and inference. A category of workloads that GPUs excel at but are underutilized for is data processing. William will discuss why large-scale data processing should be done on GPUs instead of CPUs and how different tools like cuDF, RAPIDS, and Theseus leverage GPUs for data processing.

Presented by

Chip Huyen's events

Here are some of the events I'll be speaking at/hosting in the future. Subscribe to be notified of new events!

Hosted By

2,833 Went