News
“Failing to Understand the Exponential, Again”: Why Some AI Observers Get Progress Wrong
AI fatigue is real. After the initial waves of excitement around large language models, some industry watchers are beginning to murmur that progress is slowing. Updates feel more incremental, and new releases often seem like fine-tuned reruns of past breakthroughs. But beneath the surface, an entirely different story may be unfolding—one that points not to stagnation, but to a hidden acceleration.
Julian Schrittwieser, a former DeepMind researcher, has emerged as one of the most vocal critics of the “AI slowdown” narrative. Drawing on data from two rigorous evaluation sources—METR and OpenAI’s newly unveiled GDPval benchmark—he argues that AI capabilities are continuing to improve exponentially. In other words, just because the changes aren’t always visible in flashy demos doesn’t mean progress isn’t happening. It is. And it may be faster than most people think.
Measuring Autonomy: What METR Reveals About AI’s Hidden Progress
METR, which stands for Model Evaluation and Threat Research, is a nonprofit research organization dedicated to evaluating AI models on long-horizon tasks. These are challenges that require a model to maintain coherence and problem-solving performance over extended periods, typically several hours. One of METR’s core benchmarks measures the “time horizon” that an AI model can autonomously handle before failing. According to their findings, this time horizon has been doubling roughly every seven months.
Schrittwieser highlights recent results showing that GPT‑5, OpenAI’s latest unreleased model evaluated under controlled conditions, can now successfully complete software engineering tasks lasting over two hours with a 50 percent success rate. This is a significant leap compared to its predecessors and suggests that the model is becoming increasingly capable of tackling real-world, open-ended problems without constant human supervision.
Importantly, these gains are not sudden jumps tied to splashy releases. Instead, they follow a clear exponential trend, with each model iteration contributing a steady increase in autonomous task completion. This undermines the perception that progress has plateaued. Rather than diminishing returns, Schrittwieser sees evidence of compounding returns, especially in areas like planning, code synthesis, and long-form reasoning.
GDPval and the Rise of “Real Work” Benchmarks
To further reinforce his case, Schrittwieser draws on OpenAI’s GDPval benchmark, a novel attempt to measure how AI models perform on economically meaningful tasks across a broad range of professions. GDPval includes over 1,300 real-world tasks designed by experienced professionals in law, finance, engineering, consulting, healthcare, and other high-skill industries. These tasks are not mere academic puzzles; they reflect what experts actually do in their day-to-day work.
The performance of advanced models like GPT‑5 and Anthropic’s Claude Opus 4.1 on GDPval is striking. In many cases, these models approach or even match human expert performance. They demonstrate competence in tasks ranging from legal drafting and financial modeling to software debugging and clinical decision-making. While they are not flawless—and certainly not human replacements—they show a level of professional proficiency that was unthinkable just a few years ago.
What makes this particularly compelling is the alignment between METR and GDPval. One measures long-duration autonomy; the other assesses economic usefulness. When both point to rising capabilities, the case for a hidden acceleration becomes harder to ignore. Schrittwieser suggests that improvements in one domain reinforce the others, indicating systemic advances rather than isolated spikes.
The Illusion of Stagnation
Why, then, do so many commentators believe that AI progress is slowing? Schrittwieser argues that many are falling into a cognitive trap: underestimating exponential trends because they seem linear until suddenly they are not. Just as early internet observers in the 1990s dismissed the web for its slow load times and clunky design—failing to anticipate the compound impact of Moore’s Law—today’s AI skeptics may be misjudging the curve.
Another reason for the disconnect is that benchmarks like METR and GDPval don’t capture the public’s imagination the way a viral chatbot can. It’s easier to notice changes in personality or style than in abstract measures of planning depth or time horizon. But these are precisely the metrics that matter when considering how close AI is to performing complex work independently.
Furthermore, there is a growing divide between user experience and backend capability. ChatGPT, Claude, and other tools are increasingly fine-tuned for safety, alignment, and predictability, which can obscure the raw potential of the underlying models. This intentional dampening can make newer models appear less capable or “dumber,” even when their core abilities are dramatically stronger.
Caveats and Counterpoints
Schrittwieser’s argument is not without its critics. Some point out that a 50 percent success rate on a benchmark task does not equate to a production-ready AI system. Real-world applications often demand near-perfect reliability, especially in domains like healthcare or law. Others question whether one-shot evaluations, like those used in GDPval, truly reflect the iterative, messy nature of human work.
There’s also the issue of benchmark overfitting. As AI companies design models increasingly with these evaluations in mind, it becomes harder to tell whether we’re measuring genuine general intelligence or simply training for the test. Moreover, most of the available data still comes from controlled settings rather than open deployment. Until these systems prove their worth in the wild, some skepticism remains warranted.
Even so, Schrittwieser’s broader point stands: the trendlines suggest improvement, not stagnation. And if those trends continue, then today’s caveats may become tomorrow’s footnotes.
Rethinking the Narrative
The implications of this hidden exponential growth are profound. If AI capabilities continue to compound at current rates, we could see systems capable of autonomously completing full workdays within a year or two. That doesn’t mean mass unemployment or overnight superintelligence, but it does suggest that the pace of change may soon accelerate beyond what most institutions are prepared for.
Policymakers, business leaders, and the public would do well to recalibrate their expectations. The real risk may not be overhyping AI—it may be underestimating how quickly it’s evolving in ways that matter most. The challenge now is to shift the conversation away from flashy demos and toward deeper questions about deployment, safety, and integration.
Schrittwieser’s warning is clear: don’t confuse surface stillness with underlying inertia. The exponential, once again, is easy to miss—until it isn’t.