ML Collective & Capitol AI round table and RFP: the challenge of evaluating LLMs
ML Collective and Capitol AI cohost a discussion of the importance and difficulty of evaluating large language models (LLMs) in real world deployments with multi-stage pipelines and 100k+ users.
Model evaluations are hard for several reasons, but they are even harder when in a context that makes evaluation a combinatorial problem — like in an application that depends on LLM orchestration. Capitol AI has an LLM orchestration layer with function calling, RAG, and chain of thought reasoning as key parts of the pipeline. This makes it possible for people to generate and edit multimodal documents, but the complexity of the pipeline and output artifacts make evaluation difficult.
Discussion concludes with the launch of a request for plot (RFP) from Capitol for any interested in working on these problems. A colab is available to get familiar with eval basics, as well as slides introducing the need for evaluation.
Agenda
(15 minutes) Icebreaker and personal introductions
(30 minutes) Presentation from Capitol and discussion
The importance and challenges of LLM evaluations
Current state of the art in LLM evaluation techniques:
Benchmarks and metrics (e.g., BLEU, SuperGLUE, MMLU), human evaluation methods, LLM as judge, RAG evals.
Request for Plot: build an evaluation function in Colab
Brief overview of Capitol AI and API
Introduction to Capitol.ai and its LLM orchestration
Open discussion on LLM evaluations in complex applications. What would you try?
Request for Plot (RFP)
A Request for Plot is a method for jump starting a conversation or decentralized collaboration by pitching a project starting with a big-picture idea and ending the pitch with a concrete description of the first plot that one would produce to start. It is a call for collaborators, and anyone that wishes to work on the project has only to produce the plot to start the collaboration!
In this event, Capitol AI is launching an RFP for methods of evaluating LLMs. If you're interested in working on this problem, work through the notebook, produce a plot of your results at the end, and email it to Tom Hallaran ( tom@capitol.ai ). We've also created the #capitol-ai-llm-rfp channel on the MLC Discord for discussion of this problem and any results.
About the organizations
ML Collective (MLC) is an independent, 501(c)3 non-profit organization with a mission to make machine learning research, collaboration, mentorship, and practice accessible for all.
Capitol AI is like a research assistant and AI enabled multimodal document editor in One. Capitol AI was accepted to the Summer 2024 batch of Y Combinator.