

Evaluating AI Agents: From Demos to Dependability
In this free live workshop with Ravin Kumar (DeepMind, Google, ex-Tesla) and Hugo Bowne-Anderson (Vanishing Gradients), learn how to trace, test, and debug AI agents—so they actually work in the real world.
Most agent demos look impressive—until they break in practice 😭
In this hands-on session, you’ll build the tools to make AI agents reliable:
🧵 Trace tool use and model reasoning
🎭 Simulate real interactions and edge cases
📏 Define what success actually means
🚨 Catch silent failures and iterate effectively
We’ll work through a concrete use case, a lightweight data science agent that can:
🗃️ Query a SQL database
📊 Run Python-based data analysis
📈 Generate basic visualizations
You’ll see how to evaluate whether it:
🧠 Chose the right tool
⚙️ Executed the right logic
🗣️ Explained the result correctly
And how to build this kind of iterative evaluation process into your AI agent development workflow—so reliability isn’t an afterthought.
All running locally using Gemma 3 models and Ollama.
No cloud dependencies. No frameworks required.
This is the third workshop in a series:
1️⃣ We started by building local LLM apps and adding evaluation harnesses to guide iteration.
2️⃣ Then we built agents that could call tools and adapt dynamically.
3️⃣ Now, we focus on making those agents reliable and testable.
Each session stands on its own—but together, they map the real-world development process of AI systems.
Bring your laptop. This is fully hands-on.