

Animal Benchmark Building Session
Just before the AI for Animals conference, join us for a 4 hour coworking session to create a new, light benchmark!
Description: Animal Harm Assessment (AHA) project gave us data: over 100,000 answers from 10 chatbots to 4,350 curated questions. The answers have been scored on whether they increase (or decrease) the risk of harm to animals. Some of the QA pairs could plausibly form the "gold standard" for a light QA benchmark (as opposed to an open ended one). Key unresolved questions to address:
Which questions and which answers to choose?
How to set the benchmark up technically?
How to increase the use of this (and other) animal-related benchmarks?
Come if you are interested, but especially great if you have some familiarity with:
benchmarks, benchmark development
statistical methods
python, ideally - also with Inspect Evals framework
people, labs, institutions who could use and promote this and other animal benchmarks
RSVP to this event page & share it with others who could be interested!
With questions, please reach out to:
Arturs Kanepajs, AI for Animals Benchmarking Lead akanepajs@gmail.com
Constance Li, AI for Animals Founder constance@aiforanimals.org
There will be snacks and light refreshments.
Expected agenda:
12:00-12:30 - Introductions and overview of goals
12:30-14:00 - Working session on benchmark development
14:00-14:15 - Break for refreshments
14:15-15:45 - Continue working session
15:45-16:00 - Wrap-up and next steps
Some more materials - if you can, take a look before the session:
A very short presentation on the AHA Benchmark
CSVs with public split results:
questions (~3k) and answers from 11 models
23 runs (~70k)
each answer assessed and scored by 3 LLMs-as-judges
To get updates about the outcomes and next steps: join Hive Slack (www.joinhive.org), #s-llm-benchmarking channel.