The illusion of 'thoughts' and 'reasoning' in LLMs
Chain-of-thought tokens don't reflect genuine reasoning in LLMs is misleading. They're navigational aids devoid of true cognitive processing or reliability.
This article is part of our coverage of the latest in AI research.
When a large language model (LLM) solves a complex problem, it often produces a step-by-step "chain of thought" (CoT) before giving the final answer. These “intermediate tokens” seem intuitive and transparent, making it seem like we are watching the model reason in real time.
However, a recent paper by researchers at Arizona State University argues that this is a dangerous misinterpretation. “While some forms of anthropomorphization can be treated rather indulgently as harmless and metaphorical,” they write, “viewing [intermediate tokens] as reasoning/thinking is more serious and may give a false sense of model capability and correctness.”
Their work presents evidence that these tokens are not a trace of the model's "thoughts" and that this common assumption misleads us about how these systems truly work.
Intermediate tokens
The paper’s critique is not aimed at polished, post-hoc explanations that model providers present, as seen in OpenAI o3 and Google Gemini 2.5 Pro. Instead, it focuses on the raw, unfiltered stream of tokens a model generates on its way to an answer—what the paper calls Intermediate Token Generation (ITG).
Intermediate tokens often include bits that look like human brainstorming, such as “hmm...”, “aha!”, and “but wait.” But this doesn't mean the model is using these tokens for the same purpose.
“The fact that intermediate token sequences often reasonably look like better-formatted and spelled human scratch work... doesn’t tell us much about whether they are used for anywhere near the same purposes that humans use them for, let alone about whether they can be used as an interpretable window into what the LLM is ‘thinking,’ or as a reliable justification of the final answer,” the researchers write.
Challenges of gathering reasoning traces
The very idea of training a model to "think like a human" is challenging. The initial push for reasoning data involved collecting step-by-step solutions from human annotators. But this approach has a fundamental flaw. As the paper notes, “Not only is it burdensome for people to produce granular step-by-step representations of their own thoughts, but they are unlikely to have direct and explicit access to those processes in the first place.”
To overcome this, researchers turned to more scalable, automated methods like reinforcement learning (RL) to train reasoning models.
In this RL post-training phase, a model generates numerous potential solutions, each preceded by a trace of intermediate tokens. These attempts are then checked by a verifier. Crucially, the model is typically rewarded only for producing a correct final answer, while the logical validity of the trace is ignored. The model is simply incentivized to find a path that leads to a reward, regardless of how it gets there.
This training method leads to a significant disconnect between perceived reasoning and actual outcomes. The paper highlights several experiments that reveal this gap. In one line of research, models were fine-tuned on math and coding problems using datasets where the intermediate traces were intentionally filled with incorrect steps but were still paired with the correct final solution. The models' performance still improved significantly. This suggests the model is not learning logic from the trace, but rather a stylistic pattern that increases its chances of producing a correct answer.
Prompt augmentation
If these traces are not a form of reasoning, what are they? The paper proposes a more mechanical explanation that de-anthropomorphizes language models. Think of a transformer model as navigating a vast, high-dimensional conceptual space, often called a latent space. A prompt is converted into numbers (embeddings) that represent a starting point in this space. Each token the model generates adjusts this position. From this perspective, the intermediate tokens are not "thoughts" but a learned sequence of navigational adjustments.
The model learns that generating a particular kind of "prompt augmentation" is an effective strategy for moving from the initial problem state to a region where a correct answer is highly probable. The "reasoning" is a functional tool for sequence generation, not a cognitive process. Its value lies in its ability to guide the model to the location in the latent space where the solution is more likely to be found. This can be expressed simply: the probability of finding the correct solution given the original prompt plus the extra tokens is greater than with the prompt alone.
Why it matters
For AI developers and practitioners, this perspective has direct, real-world implications. First, a plausible-looking reasoning trace should never be treated as sufficient justification for an answer's correctness. It is not an audit log. If an application requires high-stakes accuracy, the final output must be verified by an independent, reliable system, not by inspecting the model's "work." The trace can engender a false sense of confidence that is dangerous in critical applications.
Second, this understanding changes how we might approach prompt engineering and fine-tuning. The findings suggest that the structure and format of reasoning examples may be more important than their strict logical consistency. Since the model is learning a pattern that guides its generative process, providing well-structured examples could be a more effective strategy than laboring over logically pristine but convoluted ones. By shedding the "thinking machine" metaphor, we can approach these powerful tools more realistically, building safer and more effective applications based on what they actually do, not what they appear to be.