LLM in-context learning (ICL) is learning, but not how you think
LLMs show signs of learning with ICL, but it's a brittle, superficial process that relies more on patterns than true understanding.
In-context learning (ICL) is the capability that allows large language models (LLMs) to perform new tasks on the fly with just a few examples in the prompt, seemingly without any new training. It’s one of the popular capabilities that allow developers to adapt LLMs for custom applications.
A recent study by Microsoft and the University of York asks a fundamental question: is ICL “true” learning, or just a sophisticated form of pattern matching? The researchers’ main finding is a nuanced “yes, but...”.
They conclude that ICL does constitute learning in a formal sense, but it’s a brittle and superficial kind that relies heavily on statistical cues from the prompt rather than a deep understanding of the task.
This has significant practical implications for developers, suggesting that simply testing a prompt with a few examples isn’t enough to guarantee an application is robust, and that some of the most powerful prompting techniques might also be the most fragile.
What does it mean to learn?
In machine learning, learning is defined not just by getting the right answer, but by the ability to generalize. The paper uses a framework from learning theory called Probably Approximately Correct (PAC) learning. In this framework, true learning means that after seeing examples from one data distribution (let’s call it P), a learner can still perform well on new inputs from a different, unseen distribution (Q). For example, if you train a model to identify cats using photos taken in sunny parks (distribution P), real learning has occurred if it can still reliably spot a cat in a dimly lit living room (distribution Q).
The authors argue that ICL fits this formal definition. The LLM acts as the learner. The examples, or “shots,” provided in the prompt are its training data, drawn from distribution P. The new query you ask the model to solve is the test data from distribution Q. Because the LLM observes the examples and modifies its behavior at runtime to generate an answer for the new query, it is, by definition, a learning process. The crucial question the study explores is not if it learns, but how effective and robust this learning actually is.
Putting in-context learning to the test
To dissect this learning process in ICL, the researchers conducted a massive empirical study involving both open source and private LLMs: GPT-4 Turbo, GPT-4o, Mixtral 8x7B, and Phi-3.5 MoE (mixture of experts).
They generated over 1.8 million predictions on nine formal tasks chosen because, unlike the ambiguities of natural language, they have clear, objective rules, making it easier to measure true performance. The tasks ranged in complexity, from simple logic like PARITY (determining if a binary string like 10100 has an even number of zeros) to problems requiring memory and state-tracking, like Reversal (checking if race#ecar is a valid mirrored string) and simulating stack operations. Other tasks included solving mazes and verifying Hamiltonian paths in graphs (visiting every point in the graph exactly once).
They tested several prompting strategies to understand what drives ICL’s success—and failures. Beyond simply providing examples (n-Shot Learning), they used more advanced techniques. One key method was chain-of-thought (CoT), where instead of just showing an input and its final answer, the prompt includes a step-by-step reasoning process. For a maze-solving task, a CoT example would walk the model through the logic: “We begin at line 5. This line contains ‘?’. The ‘?’ character is at position 3... We will now perform a search on the neighbours to find the path... Our final set of positions is down,down,right... So the answer is 1.” This encourages the model to mimic a structured thought process.
To test whether models were truly understanding instructions or just reacting to statistical patterns, the researchers used a clever control: “Word Salad” prompts. In these experiments, they replaced meaningful instructions with random nonsense words. For example, a task description like “Your job is to learn what is the likelihood of a string to be labeled 0 or 1” was replaced with “Every zumpus is a shumpus. Polly is a lorpus.” The input data and examples, however, remained unchanged. They even created a Salad-of-Thought (SoT) where the step-by-step reasoning was also filled with gibberish, isolating the structural format of the reasoning from its semantic content.
The results from these experiments revealed a more complex picture of ICL than is commonly assumed. Contrary to the popular “few-shot” narrative, model performance consistently improved with more examples, often peaking between 50 and 100 shots. As the number of exemplars grew, the performance gap between the different LLMs and prompting styles began to shrink, suggesting that with enough data, the underlying learning mechanism becomes more important than the specific model or prompt.
While often the best-performing strategy, CoT also proved to be a double-edged sword. It was the most brittle and sensitive to “out-of-distribution” (OOD) data (inputs that were structurally different from the examples in the prompt). This suggests that CoT helps the model overfit to the specific statistical patterns in the exemplars, making it powerful but not robust. Perhaps most surprisingly, the “Word Salad” prompts eventually performed nearly as well as those with clear, human-readable instructions. This indicates that LLMs lean heavily on the statistical structure of the prompt, often more so than the actual meaning of the words.
The study concludes that ICL is a form of learning where performance scales with the number of examples provided. However, it is an “ad hoc” process that over-focuses on spurious statistical features within the prompt. This makes it an unreliable mechanism for robust generalization, especially for tasks that require a deeper, more abstract understanding.