Despite their expertise, AI developers don’t always know what their most advanced systems are capable of—at least, not at first. To find out, systems are subjected to a range of tests—often called evaluations, or ‘evals’—designed to tease out their limits. But due to rapid progress in the field, today’s systems regularly achieve top scores on many popular tests, including SATs and the U.S. bar exam, making it harder to judge just how quickly they are improving.
A new set of much more challenging evals has emerged in response, created by companies, nonprofits, and governments. Yet even on the most advanced evals, AI systems are making astonishing progress. In November, the nonprofit research institute Epoch AI announced a set of exceptionally challenging math questions developed in collaboration with leading mathematicians, called FrontierMath, on which currently available models scored only 2%. Just one month later, OpenAI’s newly-announced o3 model achieved a score of 25.2%, which Epoch’s director, Jaime Sevilla, describes as “far better than our team expected so soon after release.”
Amid this rapid progress, these new evals could help the world understand just what advanced AI systems can do, and—with many experts worried that future systems may pose serious risks in domains like cybersecurity and bioterrorism—serve as early warning signs, should such threatening capabilities emerge in future.
Harder than it sounds
In the early days of AI, capabilities were measured by evaluating a system’s performance on specific tasks, like classifying images or playing games, with the time between a benchmark’s introduction and an AI matching or exceeding human performance typically measured in years. It took five years, for example, before AI systems surpassed humans on the ImageNet Large Scale Visual Recognition Challenge, established by Professor Fei-Fei Li and her team in 2010. And it was only in 2017 that an AI system (Google DeepMind’s

