

Paper Club: A Moore's Law for AI Capabilities
Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion.
Join us as we explore "Measuring AI Ability to Complete Long Tasks," a fascinating paper that introduces a new way to track AI progress using an intuitive, human-centered metric. Instead of relying on traditional benchmarks that often saturate quickly, the researchers propose measuring AI capabilities through "task completion time horizon" - essentially asking: how long are the tasks that AI can complete with 50% reliability? By combining three diverse task suites (HCAST, RE-Bench, and a new suite called SWAA), they create a comprehensive evaluation spanning everything from 2-second decisions to 8-hour software projects.
The results are striking: AI time horizons have been doubling approximately every seven months since 2019, with current frontier models like Claude 3.7 Sonnet capable of completing tasks that typically take humans around 50 minutes. Through qualitative analysis, the researchers identify key drivers of this progress - improved logical reasoning, better tool use, and crucially, greater ability to adapt to mistakes rather than repeating failed actions. However, they also find that models still struggle significantly with "messier" real-world tasks that lack clear feedback loops or require proactive information gathering.
If these trends continue and generalize to real-world tasks, the implications are profound: the paper's extrapolations suggest AI could be automating month-long software development tasks within the next 5-7 years. Join us to discuss this compelling new framework for measuring AI progress, its external validity concerns, and what exponential growth in task completion abilities might mean for the future of work and AI capabilities.
The paper can be found here: https://arxiv.org/abs/2503.14499