How to train reasoning models with very few examples
You don't need thousands of SFT instances or heavy RL training. A few hundred well-curated instances will do.
A new study by researchers at Shanghai Jiao Tong University shows that with a small batch of well-curated examples, you can train an LLM for complex reasoning tasks. The concept, which they describe a “less is more for reasoning” (LIMO), challenges the established notion that to train reasoning models, you need thousands of supervised fine-tuning (SFT) examples or the more recent focus on compute-intensive reinforcement learning (RL).
In their experiments, the researchers show that a Qwen2.5-32B-Instruct model fine-tuned on a LIMO dataset containing 817 training examples outperforms SFT models with hundred times more training examples on the challenging MATH and AIME datasets. It also outperformed the open source reasoning model QwQ-32B-Preview and OpenAI’s flagship o1-preview model, both of which have been trained with larger data and compute resources.
LIMO-trained models also generalize to out-of-distribution (OOD) examples. For example, their LIMO model outperformed QwQ on the OlympiadBench scientific benchmark and scored 66.7% on GPQA, close to OpenAI-o1-preview’s 73.3%.
According to the researchers, there are two key reasons that current LLMs can learn complex reasoning tasks through LIMO:
Their training data contains a lot of math and coding examples, enabling them to encode rich reasoning knowledge in their parameters. LIMO training enables them to elicit that inherent knowledge more efficiently.
New post-training techniques based on inference-time scaling laws enable models to generate extended reasoning chains, which improves their ability to use their reasoning knowledge.
Per the researchers: “if models possess rich reasoning knowledge and are given adequate computational space, then activating their reasoning capabilities may require only a small number of high-quality training samples that encourage extended deliberation, rather than massive fine-tuning datasets.”
It is worth noting that LIMO requires a more thorough data curation process. But since it is limited to a few hundred examples, it can be manageable by organizations that are resource constrained.
According to the researchers, creating useful LIMO datasets requires choosing the right problems and solutions. Prioritize challenging problems that require complex reasoning chains, diverse thought processes, and knowledge integration. Also make sure that the problems deviate from the model’s training distribution to encourage new reasoning approaches and force it toward generalization. Experiments show that all other things equal, models trained on more challenging examples generalize better across different reasoning tasks.
Accordingly, craft solutions that are clear and well-organized, with the reasoning steps adapted to the complexity of the problem. High-quality solutions should also provide strategic educational support by gradually building understanding through carefully structured explanations.
The researchers have released the code and data used to train the LIMO models and plan to expand the concept to other domains and applications.
I think this can prove to be a game-changer for creating customized reasoning models. Current approaches (SFT and RL) are either slow or expensive or both. On the other hand, crafting a few hundred examples is an endeavor that many companies can tackle.
Ben, how did you learn AI, ML, and deep learning?