

Human-seeded Evals: Scaling Judgement with LLMs (with Samuel Colvin)
Everyone agrees evals are essential—but almost no one implements them rigorously. Why? They're hard to write, time-consuming, and brittle. Samuel Colvin, creator of Pydantic and co-founder of Logfire, has been thinking deeply about how to fix that.
In this livestream, we’ll dive into “Human-seeded Evals,” a lightweight process for generating scoring rubrics and LLM-as-a-judge systems by bootstrapping from a few hand-labeled examples. It’s practical, fast, and being tested in dev workflows today.
We’ll discuss:
🔁 How to seed a feedback loop with just a few examples
🧪 Using LLMs to scale qualitative judgement
📉 Where evals fail—and how to recover
📊 Lessons from experimenting with this approach in Logfire
🧠 What robust evaluation could look like for agents and apps
Whether you're debugging an LLM agent, trying to track regressions, or just tired of vibes-based dev cycles, this one’s for you.