Cover Image for Optimizing Training Workloads on GPU Clusters

Hosted By

Optimizing Training Workloads on GPU Clusters

Hosted by Together AI

YouTube

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Register to receive an email link to the recording of this episode of the Together Learning Series!

Join Lucien Avramov (Principal Product Manager) and Ryan Lucchese (Senior Engineer) for a session on optimizing training workloads on GPU clusters.

The talk will cover best practices, technical guidance and a live demonstration on a 2-node instant Kubernetes cluster. We’ll walk through key considerations from initial setup through to training execution and system monitoring.

What we’ll cover:

Pre-Cluster Planning: Choosing between Kubernetes and Slurm, sizing GPU resources, and understanding model and data requirements
Pre-Flight Validation: Verifying hardware (GPUs, CPUs, memory), software stack (e.g., Docker), and network configuration for RDMA or Ethernet-based setups
CPU and GPU Optimization: Understanding workload characteristics, NUMA node configuration, and avoiding common bottlenecks (e.g., CPU-heavy preprocessing)
Storage and Data Handling: Comparing parallel file systems vs. local NVMe, managing data ingestion/output, and minimizing transfer overhead
Failure Recovery and Observability: Addressing issues like GPU errors, node lockups, and network flaps, and implementing robust observability with tools like nvidia-smi and GPU utilization monitors
Live Demo: Running a real training job with basic observability in place, and demonstrating progress checks and troubleshooting workflows

The session is designed to help you reduce time-to-first-training, improve workload reliability, and make informed decisions about cluster configuration and performance tuning.

Hosted By