From the Frontlines: Expert Perspectives on LLM & Multi-Modal Evaluation

Name: From the Frontlines: Expert Perspectives on LLM & Multi-Modal Evaluation
Start: 2025-02-26T17:00:00.000-08:00
End: 2025-02-26T20:00:00.000-08:00
Location: GitHub

Arize AI

GitHub

San Francisco, California

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Dive deep into the cutting edge of AI evaluation at GitHub HQ in San Francisco! This meetup brings together AI developers and practitioners for two in-depth talks from industry experts, covering the latest techniques for evaluating AI applications, including multi-modal systems. Whether you’re working on classifying audio sentiment, evaluating agentic systems, or still figuring out what evaluations are right for your use case, this event offers actionable insights to elevate your work. Enjoy light food, drinks, and the chance to connect with fellow AI professionals.

Common Mistakes When Running LLM Evals

Speaker: Hamel Husain - Founder @ Parlance Labs

Evaluating LLMs is fraught with pitfalls, from poorly defined metrics to flawed test sets and over-reliance on benchmarks. In this talk, we’ll explore the most common mistakes practitioners make when running LLM evaluations and share practical strategies to avoid them. You’ll walk away with best practices and actionable insights to design evaluations that truly reflect your model’s real-world performance. Whether you’re a beginner or an expert, this talk will help you level up your LLM evals and get reliable, meaningful results.

Best Practices for Tracing and Evaluating Multi-Modal Apps

Speaker: Jason Lopatecki - Co-Founder & CEO @ Arize AI

Evaluating multi-modal AI applications—spanning text, image, and audio—can be complex, but it's essential for building reliable systems. This talk will cover best practices for tracing and evaluating models across modalities, with a special focus on audio evaluation, a newly introduced capability in Arize. Learn how to assess transcription accuracy, analyze embeddings, and gain deeper insights into model performance. Whether you’re new to multi-modal AI or experienced in the field, this session will provide practical techniques to improve your evaluation workflows and build more trustworthy applications.

[Fireside Chat] The State of LLM Evaluation: What’s Working, What’s Not, and What’s Next

Join Jason Lopatecki, Hamel Husain, and SallyAnn DeLucia (Senior Product Manager, Arize AI) as they discuss the biggest shifts in LLM evaluation, common pitfalls teams face, and the future of assessing multi-modal apps.

Location

GitHub

88 Colin P Kelly Jr St, San Francisco, CA 94107, USA

Presented by

Arize AI

Generative AI-focused workshops, hackathons, and more. Come build with us!

Hosted By

353 Went

From the Frontlines: Expert Perspectives on LLM & Multi-Modal Evaluation

​Common Mistakes When Running LLM Evals

​Best Practices for Tracing and Evaluating Multi-Modal Apps

​[Fireside Chat] The State of LLM Evaluation: What’s Working, What’s Not, and What’s Next

Common Mistakes When Running LLM Evals

Best Practices for Tracing and Evaluating Multi-Modal Apps

[Fireside Chat] The State of LLM Evaluation: What’s Working, What’s Not, and What’s Next