How reinforcement learning generalizes LLM behavior

Just let the model find its own solutions and stop holding its hand.

Feb 13, 2025

Language models can generalize better when left to create their own solutions as opposed to being trained on hand-crafted examples, according to a study by researchers at Hong Kong University and the University of California, Berkeley.

For a long time, supervised fine-tuning (SFT) has been the gold standard for training LLMs and VLMs. After pre-training models on raw text and image data, companies use hand-crafted examples to fine-tune the model for specific behavior or output formats. Only after SFT does the model undergo additional steps such as such as reinforcement learning from human feedback (RLHF) or RL from AI feedback.

Gathering SFT data can be slow and costly. Recently, AI labs have shown that pure reinforcement learning (RL) approaches can provide impressive results. This was highlighted by DeepSeek-R1, an open weights model that competes with OpenAI’s reasoning models.

The new study focuses on the generalization abilities of RL and SFT training in textual and visual reasoning tasks. A model trained on a specific set of rules or visual environments should be able to perform the tasks in variants of those rules and environments.

The researchers chose two tasks for their experiments: GeneralPoints (a card game that requires arithmetic reasoning) and V-IRL (a real-world navigation task that requires spatial reasoning).

They chose Llama-3.2-Vision-11B as the backbone model for their tests. They warmed up the training with a small SFT dataset for each task, and they separately scaled the training on RL (where the model explores solutions and creates training examples) and SFT (the model is trained only on hand-crafted examples). They then tested them on in-domain and out-of-domain examples in text-only and multimodal settings.

For example, in the GeneralPoints benchmark, the model might be trained on a specific type of arithmetic rules but tested on a different set of rules. In the visual tasks, in addition to changing the rules of the problems, they also changed visual settings such as colors, shapes, and environment details.

Their findings show that reinforcement learning consistently improves performance on OOD examples while SFT training seems to memorize the training rules and doesn’t generalize. These observations apply to both text-only and multimodal settings.

Despite their findings, the researchers point out that SFT is crucial for stabilizing the model’s behavior. Without the initial SFT warmup stage stage, RL training did not achieve desirable results.

This is a bit different from the results obtained by DeepSeek-R1-Zero, which was post-trained on pure RL. Nonetheless, it is clear that reinforcement learning has a lot of potential. As long as you can formulate your problem in a way that can be objectively verified, there is a great chance you can use RL to train the model on it.