Cover Image for Optimizing Training Workloads on GPU Clusters
Cover Image for Optimizing Training Workloads on GPU Clusters
Hosted By

Optimizing Training Workloads on GPU Clusters

Hosted by Together AI
YouTube
Registration
Past Event
Welcome! To join the event, please register below.
About Event

​Register to receive an email link to the recording of this episode of the Together Learning Series!

Join Lucien Avramov (Principal Product Manager) and Ryan Lucchese (Senior Engineer) for a session on optimizing training workloads on GPU clusters.

The talk will cover best practices, technical guidance and a live demonstration on a 2-node instant Kubernetes cluster. We’ll walk through key considerations from initial setup through to training execution and system monitoring.

What we’ll cover:

  • Pre-Cluster Planning: Choosing between Kubernetes and Slurm, sizing GPU resources, and understanding model and data requirements

  • Pre-Flight Validation: Verifying hardware (GPUs, CPUs, memory), software stack (e.g., Docker), and network configuration for RDMA or Ethernet-based setups

  • CPU and GPU Optimization: Understanding workload characteristics, NUMA node configuration, and avoiding common bottlenecks (e.g., CPU-heavy preprocessing)

  • Storage and Data Handling: Comparing parallel file systems vs. local NVMe, managing data ingestion/output, and minimizing transfer overhead

  • Failure Recovery and Observability: Addressing issues like GPU errors, node lockups, and network flaps, and implementing robust observability with tools like nvidia-smi and GPU utilization monitors

  • Live Demo: Running a real training job with basic observability in place, and demonstrating progress checks and troubleshooting workflows

The session is designed to help you reduce time-to-first-training, improve workload reliability, and make informed decisions about cluster configuration and performance tuning.

Hosted By