Evaluating LLMs - recent practice and new approaches - AIFoundry.org Podcast
Providing AI developers with more choices through open models & platforms is a nice idea, but we need ways to evaluate what what our choices should be as they map to our application requirements.
Yulia Yakovleva will join us with her review of the the foundational paper “Evaluating Large Language Models: A Comprehensive Survey” by Zishan Guo and colleagues. This paper categorizes the evaluation of LLMs into knowledge and capability, alignment, and safety. It underscores the necessity for thorough evaluation to harness the benefits of LLMs while mitigating risks such as bias and misinformation. The most direct example of this approach is embodied in the Open LLM Leader’s Board on Hugging Face.
More recently, a new paper, “LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code” by Naman Jain and colleagues that provides a new approach for augmenting evaluation of LLMs. While this was specific to code generation, we’ll discuss how this approach.
AIFoundry.org podcasts are held live in front of a virtual studio audience in the AIFoundry.org Discord community: https://discord.gg/WNKvkefkUs