The AI Benchmark That Stumps Every Model: 2,500 Expert Questions That Current Systems Cannot Answer
Standard AI benchmarks have been outpaced. The MMLU exam, once a meaningful test of language model capability, now produces near-perfect scores from leading systems. The problem this creates is not merely academic: benchmarks are how developers, policymakers, and the public understand what AI can and cannot do. When the tests become too easy, the gap between measured capability and actual capability grows harder to see.
A consortium of nearly 1,000 researchers from disciplines spanning mathematics, ancient languages, medicine, biology, and physics built a response: Humanity's Last Exam (HLE), a 2,500-question assessment designed specifically to resist AI success. The methodology was simple in principle and rigorous in execution - any question that any leading AI system could answer was removed. What remained was published in Nature, with documentation at lastexam.ai.
How the Questions Were Designed
Each question had to satisfy several criteria: it needed a single, unambiguous, verifiable answer; it could not be solved through simple internet retrieval; and it had to draw from genuine expert-level knowledge rather than the kind of synthesis tasks that language models handle well. Contributors included historians, physicists, linguists, medical researchers, and computer scientists from institutions worldwide.
Among the contributors was Tung Nguyen, instructional associate professor in the Department of Computer Science and Engineering at Texas A&M, who authored 73 of the 2,500 public questions - the second-highest contributor count, and the most in mathematics and computer science. Questions range from translating ancient Palmyrene inscriptions to identifying microanatomical structures in birds to analyzing the phonological features of Biblical Hebrew.
After drafting, each question was tested against leading AI models. If any system produced the correct answer, the question was eliminated. The result is a benchmark calibrated to sit precisely at the current edge of AI capability - not hypothetically hard, but demonstrably hard based on actual performance.
Where the Leading Models Currently Stand
Early evaluation results placed the performance of major systems in a range that illustrates how wide the gap remains. GPT-4o scored 2.7 percent. Claude 3.5 Sonnet reached 4.1 percent. OpenAI's o1 model achieved 8 percent. More recent models, including Gemini 3.1 Pro and Claude Opus 4.6, have reached approximately 40 to 50 percent accuracy.
That last figure requires context. Fifty percent on a 2,500-question exam filtered to remove anything AI can currently answer is not a score that would indicate competence in any of the fields the exam covers. It is a measure of progress against a deliberately elevated target, not a claim of near-human performance. The benchmark is designed to remain challenging as models improve - additional questions can be added, and the pool of unanswerable questions will grow as models advance.
Why Benchmark Saturation Matters
"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," Nguyen said. The concern is not abstract. AI products are being deployed in medical, legal, and educational settings on the basis of capability claims that derive partly from benchmark performance. If the benchmarks are saturated, those claims overstate actual capability in precisely the domain - expert-level specialized knowledge - where errors matter most.
The HLE framework addresses this by maintaining question secrecy for most of the exam. A subset is publicly available for community evaluation; the majority is kept hidden to prevent training data contamination. AI developers submitting systems for evaluation receive scored results without gaining access to the questions, reducing the risk that future training runs will memorize the test.
What the Benchmark Is Not
Nguyen pushed back against the name's apocalyptic framing. "This isn't a race against AI. It's a method for understanding where these systems are strong and where they struggle." The goal is measurement, not alarm. A system that can score 50 percent on questions that defeated earlier systems by a factor of 20 has improved significantly. Knowing where it still fails - and specifically which kinds of expert reasoning remain beyond reach - is valuable information for developers and for anyone deciding how to deploy these tools.
The exam's breadth - spanning mathematics, natural sciences, humanities, and highly specialized subfields - is itself informative. Current AI systems fail not in one narrow domain but across domains, suggesting that the limitations are not subject-specific but reflect something more fundamental about how current architectures process and apply expert knowledge.