In the race to build machines that think, a growing body of research suggests we may be mistaking verbosity for intelligence.

While developers of Large Reasoning Models (LRMs), the next generation of language models engineered to mimic step-by-step reasoning, claim dramatic improvements in accuracy and logic, a new study reveals those claims are both overstated and fundamentally flawed.

Published by a team of researchers at Apple, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” offers one of the most sobering evaluations yet of AI’s much-hyped cognitive potential.

Authored by Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar, the paper doesn’t simply challenge the prevailing narrative, it dismantles it. Their central thesis is direct: current Large Reasoning Models don’t actually “reason” in any consistent or reliable way.

Instead, they simulate the appearance of rational thought through pattern repetition and token inflation, a process that breaks down entirely under modest increases in problem complexity.

Using custom puzzle environments to bypass the contamination common in standard benchmarks like MATH500, the team reveals that frontier models such as Claude 3.7 Sonnet and DeepSeek-R1 quickly collapse into incoherence once the task requires more than superficial computation.

This failure is not subtle. Even with access to solution algorithms in advance, the models often failed to follow explicit instructions. In tests involving the Tower of Hanoi puzzle, LRMs with embedded reasoning chains performed no better than when left to their own trial-and-error methods. It was a damning indicator that these systems cannot reliably execute even basic symbolic logic.

The paper defines three distinct “regimes” of performance. At low complexity, standard non-reasoning models perform just as well, if not better, than the more sophisticated reasoning versions. In the midrange, where a bit more logic is required, LRMs begin to show value, but only up to a point. Beyond that threshold, both reasoning and non-reasoning models collapse, regardless of how much computation time or token budget they are given.

Worse, the models’ internal behavior exhibits what the authors call an “overthinking phenomenon.” Instead of halting after a correct answer is found, LRMs continue generating additional, often incorrect, reasoning paths.

This leads to wasted computational resources and declining performance. In other cases, the models latch onto a flawed early conclusion and then spend the remainder of their tokens reinforcing that initial error, unable to self-correct.

The illusion of thinking becomes a kind of performance, an AI dragging the user through a long, plausible-sounding monologue that ultimately fails to arrive at a correct or consistent destination. As complexity increases, these performances become shorter and less coherent, suggesting the models are not only failing to think harder but actively thinking less.

This pattern held across multiple puzzle types, including River Crossing, Blocks World, and Checker Jumping. Each was designed to test planning, constraint satisfaction, and symbolic manipulation under varying levels of difficulty. Despite their differences, all puzzles produced the same result: a hard ceiling on AI reasoning performance, followed by collapse.

One particularly damning experiment involved giving models the precise algorithmic steps to follow for solving Tower of Hanoi puzzles. Even when spoon-fed the procedure, models still stumbled. Not because they misunderstood the rules, but because they couldn’t reliably execute sequential logic without deviating or failing to retain state across steps.

This has profound implications for any attempt to use AI for complex, real-world problem-solving in fields like engineering, law, healthcare, or education. If today’s leading models struggle with controlled logic puzzles, what happens when they’re asked to assist in open-ended, high-stakes reasoning tasks?

The promise of LRMs was never just faster answers. It was better thinking. And yet the Apple study suggests that the current generation of models cannot meaningfully reason beyond surface-level mimicry. Their self-reflection routines, marketed as advanced metacognitive abilities, are at best inefficient and, at worst, misleading.

Even more troubling is the trend observed as complexity grows. Models begin to allocate fewer resources to their own reasoning process. Despite operating far below their generation limits, LRMs taper off their use of “thinking tokens” as tasks get harder. This inverse relationship between challenge and effort suggests that AI’s celebrated capacity to “scale with complexity” is not only limited, but fundamentally broken in its current form.

What emerges from this investigation is not a portrait of intelligent systems inching toward human-like reasoning but a field at risk of confusing verbosity with cognition. The failures highlighted in “The Illusion of Thinking” study are not marginal bugs. They are architectural limitations. Even when fed the correct algorithm and surrounded by structured rules, current LRMs reveal an inability to follow instructions or sustain coherent logic chains beyond a narrow threshold of complexity.

Nowhere is this more apparent than in the analysis of “reasoning traces,” the internal, step-by-step thoughts that these models generate to reach their conclusions. In simpler puzzles, models sometimes stumble onto the correct solution early, only to discard it and wander into faulty alternatives. In moderately complex tasks, solutions appear only after wading through a swamp of incorrect thinking. And when complexity increases further, correct solutions vanish entirely.

The supposed advantage of structured thought becomes a liability — a performance ritual that masks cognitive disintegration. This collapse is not a one-off anomaly. It occurred consistently across models and puzzle types, even when the inference compute, the total token budget available, was left generous.

Rather than doubling down and thinking harder as problems grew more intricate, the models appeared to give up. They used fewer tokens, tried fewer moves, and reduced their cognitive effort precisely when it was needed most.

The implications are alarming. Models celebrated for their emergent thinking behave less like problem-solvers and more like improvisational actors whose confidence fades as the scene gets complicated.

Perhaps the most unsettling takeaway is that the models’ creators may be incentivizing these limitations. The study shows that reasoning improvements in LRMs are tightly coupled to reinforcement learning regimes that reward surface-level plausibility, not actual logical consistency.

In other words, these systems are trained to sound smart, not to be smart. And since popular reasoning benchmarks often suffer from data contamination, it’s unclear whether performance gains are due to genuine reasoning or accidental memorization.

This leads to one of the paper’s most damning observations. Even when evaluation environments are designed to be fair, novel, and controllable, like the custom puzzle simulators used in this study, reasoning models still fail.

Their training data doesn’t prepare them for structured environments requiring rule-based solutions, and their architecture doesn’t support robust planning or abstract symbolic manipulation.

These are not the symptoms of a technology just a few upgrades away from general intelligence. These are deep design flaws.

The authors of the study are careful to note the value of what they call the “middle regime,” that narrow band of problem difficulty where LRMs do outperform their simpler, non-reasoning counterparts. But they also caution against reading too much into this success.

Gains in this region come at great computational cost, and the models quickly hit diminishing returns. Worse still, the boundary between middle and high complexity is treacherously thin. Just a modest increase in problem depth is enough to plunge LRMs into collapse.

For AI developers and policymakers, the paper offers a crucial warning. We may be designing products and regulations based on an inflated sense of what current models can do. The illusion of coherent thought, complete with neatly formatted reasoning steps, is seductive.

But beneath the surface lies a brittle process that buckles under pressure. The more we rely on LRMs to simulate thinking, the more we risk mistaking decoration for depth. If AI is to advance beyond this shallow mimicry, future models will need more than clever prompting or reinforcement fine-tuning. They will need architectures capable of internalizing logic, sustaining abstract thought, and correcting their own paths when wrong.

Until then, users and institutions must reckon with a hard truth. Today’s reasoning models are not thinking machines. They are theatrical ones, performing intelligence without possessing it.

© Image

Cora Yalbrin