Evaluating LLMs: Needle in a Haystack
LLM evaluation is a discipline where confusion reigns and foundation model builders are effectively grading their own homework.
Building on the viral threads on X/Twitter, Greg Kamradt, Robert Nishihara, and Jason Lopatecki discuss highlights from Arize AI's ongoing research on how major foundation models – from OpenAI’s GPT-4 to Mistral and Anthropic’s Claude – are stacking up against each other at important tasks and emerging LLM use cases, covering and explaining the importance of results of Needle in a Haystack tests and other evals results on hallucination detection on private data, question-and-answer, code functionality, and more.
Curious which foundation models your company should be using for a specific use case – and which to avoid? You won’t want to miss this meetup!
-------
Agenda:
5:30 PM - 6:00 PM: Arrival & Networking
6:00 PM - 6:30 PM: Fine-tuning for Context Length Extension + Q&A w/ Kourosh Hakhamaneshi
6:30 PM - 7:15 PM: Evaluating LLMs: Needle in a Haystack Fireside Chat + Q&A
7:15 - 8:00: Networking & Drinks