Cover Image for NVIDIA: Optimizing Inference on Large Language Models With NVIDIA

Presented by

Welcome to our NVIDIA community! This page is run by passionate individuals who love NVIDIA products and technology. Please note that events and activities may not be officially endorsed by NVIDIA.

Hosted By

NVIDIA: Optimizing Inference on Large Language Models With NVIDIA

Team Green AI - NVIDIA

Virtual

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Please register here: https://2ly.link/25pU4

Event Information

Date: Thursday, April 17, 2025
Time: 2:00 p.m. IST | 4:30 p.m. SGT | 9:30 a.m. CET
Duration: 1 hour 30 mins

Build Production-Ready Generative AI Solutions to Transform Your Organization

AI inference––the process of running trained models to generate predictions—has become a major focus in AI development, especially with the rise of large language models (LLMs) like GPT, Llama, and Gemini. Unlike training, which happens in high-performance data centers, AI inference needs to be efficient, scalable, and cost-effective for real-world applications. For the successful use of AI inference, organizations need a full-stack approach that supports the end-to-end AI life cycle and tools that empower teams to meet their goals.

In this webinar, we’ll take a detailed look into LLM inference and optimization, including theory, model architecture, and mathematical foundations. You’ll learn about prompt processing, token generation, and inference optimization with TensorRT-LLM. We'll also discuss measuring inference performance, including latency and throughput.

The NVIDIA AI Enterprise software platform consists of NVIDIA NIM™ microservices, NVIDIA Triton™ Inference Server, NVIDIA® TensorRT™, and other tools to simplify building, sharing, and deploying AI applications. With enterprise-grade support, stability, manageability, and security, enterprises can accelerate time to value while eliminating unplanned downtime. Register for this webinar and explore the benefits of NVIDIA AI for accelerated inference. Time will be available at the end of the webinar for Q&A.

Who Should Attend:

Developers with Python proficiency and a solid understanding of LLM theory who are working in generative AI and seeking to enhance their knowledge of LLM inference and optimization.

Prerequisites 

Python programming proficiency
Familiarity with LLMs and their theoretical foundations
Basic GPU computing and memory management knowledge

Learnings

In this webinar, you'll gain a deep understanding of LLM inference theory and applications, learning about:

Efficient prompt processing and token-generation techniques
Inference performance optimisations using TRT-LLM and NVIDIA Triton
LLM performance measurement with GenAI-Perf, hosted on TensorRT-LLM and Triton, and a text summarisation use case
Real-world case studies: Get answers on how to benchmark for different LLMs, understand the latency and throughput of those LLMs, and learn how to design systems around LLM considering latency and throughput.

Don't miss this opportunity to enhance your expertise or master a new technology. Plus, receive a free NVIDIA self-paced training course (valued at up to USD 90) when you attend this webinar.

By filling out the form, you agree to share your data with our partner, NVIDIA. Your information will be handled in accordance with NVIDIA's Privacy Policy.