What to know about the limits of RLHF
There is a lot of excitement about the benefits of reinforcement learning from human feedback (RLHF). But the limitations of RLHF are much less discussed.
A new paper by researchers at various institutions discusses these limitations in depth and provides suggestions for improving RLHF and making large language models much more robust.
The paper goes into the challenges of each component of RLHF separately and also discusses some of the challenges that can’t be solved through modifications to RLHF.
Key findings:
There are three key components to RLHF, each of which has its own challenges: human feedback, reward model, and policy
Limits of human feedback: Human annotators have their own biases, don’t agree on the same things, can have malicious intents, and fall into cognitive traps (e.g., they will give a high rating to a false answer that is formulated in a confident tone)
Limits of reward model: When you model complex human preferences into a single reward, you lose important signals; reward models are prone to “reward hacking” (i.e., learning the wrong patterns)
Limits of policy: The RL loop can be influenced by the training data of the pre-trained model; RL agents are prone to adversarial attacks; RL models can suffer from “mode collapse”
Remedies: optimize the feedback process to generate more long-form responses; use reward models that learn multi-modal distributions; put more energy in filtering the data used to train the foundation model.
Read the full article on TechTalks.
For more on AI research:


