

Ragas Paper Club #2
Paper Club #2 – The Leaderboard Illusion
Large-language-model leaderboards look definitive—until you peek behind the curtain. The Leaderboard Illusion (Singh, Nan, et al.) reveals how undisclosed testing, cherry-picked “best runs,” and skewed sampling quietly tilt Chatbot Arena’s rankings, warping the very notion of “state-of-the-art.”
On July 24th @ 09:00 AM PT we’ll dig into:
Shadow runs & score retractions – how vendors trial dozens of versions, publish the single top score, then erase the rest.
Data-access asymmetry – why giants with privileged API taps get to fine-tune on >60 % of Arena battles, sidelining open-source peers.
Deprecations that break the graph – silent model removals that shatter Bradley-Terry comparisons and inflate win rates.
A roadmap to fairness – five concrete fixes (from banning retractions to open-sourcing the sampler) that could reboot LLM benchmarking integrity.
Speakers
• Shahul – Founder, Ragas
• Mike – Product Leader, Ex-NodeSource
20 min walkthrough → 15 min live Q&A.
Live and Free, on Zoom.
Slides, notes, and code links shared afterward.
See you in the chat!
🔗 Paper: https://arxiv.org/abs/2504.20879