DeepSeek's new reward model takes RL to open-domain tasks

Generative reward modeling uses principles and critiques to help LLMs to learn reasoning about tasks without explicit ground-truth signals

Apr 23, 2025

DeepSeek made a splash a few months ago by training its reasoning model on pure reinforcement learning through outcome-based rewards. Now it is taking the next step with Self-Principled Critique Tuning (SPCT), a technique for creating generalist reward models that can be applied not only to coding and math tasks but a wide range of applications.

Reinforcement learning (RL) has become a key component of fine-tuning LLMs for instruction-following and reasoning capabilities. Reward models (RM) are an essential part of RL, giving reward or penalty signals to the response of the LLM and steering it in the right direction.

Unfortunately, RMs are usually limited to narrow domains with clear-cut rules or easily verifiable answers. For example, DeepSeek-R1 underwent an RL phase, in which it was trained on math and coding problems where the ground truth is clearly defined.

Designing reward models for complex, open-ended, or subjective tasks without explicit ground truths is very difficult and usually requires a lot of manual data labeling.

The DeepSeek researchers highlight four key challenges in creating generalist RMs:

Handling various input types and evaluate one or more responses at the same time.
Generating accurate reward signals across diverse domains without clear ground truth signals.
The ability to improve the quality of rewards when inference-time compute is scaled.
Learning behaviors that can scale with inference-time compute.

Reward models can come in different flavors across reward generation and scoring paradigms.

The DeepSeek researchers propose “pointwise generative reward modeling” (GRM) as the base for a scalable and flexible generalist reward model. Pointwise scoring assigns individual scores to each response while generative reward models produce textual critiques for responses.

In their preliminary experiments on GPT-4o and Gemma-2-27B, the researchers found that “certain principles could guide reward generation within proper criteria for GRMs” and that “inference-time scalability of RM might be achieved by scaling the generation of high-quality principles and accurate critiques.”

Based on these findings, they developed Self-Principled Critique Tuning (SPCT), a technique that trains the GRM to dynamically generate principles and critiques based on prompts and responses.

Instead of hard-coded principles, SPCT trains the GRM to generate the principles based on the input query and as part of the reward generation process. This enables the GRM to dynamically adapt to each task.

SPCT involves two main phases:

Rejective fine-tuning: The GRM is trained to generate principles, critiques, and rewards for given queries/responses in the required format. The GRM’s outputs are accepted only if the reward aligns with the ground truth.

Rule-based RL: The model is further fine-tuned through outcome-based reinforcement learning. This encourages the GRM to learn how to generate effective principles and accurate critiques dynamically and in a scalable way.

"By leveraging rule-based online RL, SPCT enables GRMs to learn to adaptively posit principles and critiques based on the input query and responses, leading to better outcome rewards in general domains," the researchers write.

GRM processes the same input (query/response) multiple times, generating different sets of principles and critiques. The final reward is determined by aggregating the scores, providing a wider range of perspectives.

The researchers introduce a “meta RM,” a separate lightweight RM that determines the quality of the principle/critique generated by the primary GRM and filters out low-quality principles/critiques.

The researchers applied SPCT to Gemma-2-27B to create DeepSeek-GRM-27B, and evaluated it against other reward techniques such as LLM-as-a-Judge, scalar RMs, and semi-scalar RMs.

DeepSeek-GRM-27B outperformed baseline methods trained on the same data. The model could also leverage inference-time scale, outperforming large models such as Nemotron-4-340B-Reward and GPT-4o when given a larger compute budget to sample more principles/critiques and filtering them with the meta RM.

Interestingly, SPCT showed less bias across different domains compared to scalar RMs, which often performed well on verifiable tasks but poorly elsewhere.

The DeepSeek team suggests that future directions could include “integrating GRMs into online RL pipelines as versatile interfaces of reward systems, exploring inference-time co-scaling with policy models, or serving as robust offline evaluators for foundation models.”

TechTalks

Discussion about this post