From the Frontlines: Expert Perspectives on LLM & Multi-Modal Evaluation
Dive deep into the cutting edge of AI evaluation at GitHub HQ in San Francisco! This meetup brings together AI developers and practitioners for two in-depth talks from industry experts, covering the latest techniques for evaluating AI applications, including multi-modal systems. Whether you’re working on classifying audio sentiment, evaluating agentic systems, or still figuring out what evaluations are right for your use case, this event offers actionable insights to elevate your work. Enjoy light food, drinks, and the chance to connect with fellow AI professionals.
Common Mistakes When Running LLM Evals
Speaker: Hamel Husain - Founder @ Parlance Labs
Evaluating LLMs is fraught with pitfalls, from poorly defined metrics to flawed test sets and over-reliance on benchmarks. In this talk, we’ll explore the most common mistakes practitioners make when running LLM evaluations and share practical strategies to avoid them. You’ll walk away with best practices and actionable insights to design evaluations that truly reflect your model’s real-world performance. Whether you’re a beginner or an expert, this talk will help you level up your LLM evals and get reliable, meaningful results.
Best Practices for Tracing and Evaluating Multi-Modal Apps
Speaker: Jason Lopatecki - Co-Founder & CEO @ Arize AI
Evaluating multi-modal AI applications—spanning text, image, and audio—can be complex, but it's essential for building reliable systems. This talk will cover best practices for tracing and evaluating models across modalities, with a special focus on audio evaluation, a newly introduced capability in Arize. Learn how to assess transcription accuracy, analyze embeddings, and gain deeper insights into model performance. Whether you’re new to multi-modal AI or experienced in the field, this session will provide practical techniques to improve your evaluation workflows and build more trustworthy applications.
[Fireside Chat] The State of LLM Evaluation: What’s Working, What’s Not, and What’s Next
Join Jason Lopatecki, Hamel Husain, and SallyAnn DeLucia (Senior Product Manager, Arize AI) as they discuss the biggest shifts in LLM evaluation, common pitfalls teams face, and the future of assessing multi-modal apps.