Generative AI-focused workshops, hackathons, and more. Come build with us!

Arize AI

Join Arize Co-Founder, Jason Lopatecki, and ML Growth Lead, Amber Roberts, as they discuss “AgentBench: Evaluating LLMs as Agents”. This paper explores  AgentBench, the first benchmark designed to evaluate LLMs' ability to operate as autonomous agents in various scenarios. We'll talk through the paper's finding of a significant performance gap between leading commercial API-based LLMs and open-source alternatives, and the impact that disparity will have on the future the industry.

Community Paper Reading: Evaluating LLMs as Agents

Brian Kitano

Roy Nallapeta

Atal Agarwal

Mat Allen

Vibhu Sapra

Sagar Saija

Raymond Lee

Jason Hu

James Murdza

Mei Chen

Standard