RLHF (Reinforcement Learning from Human Feedback)

RLHF is the post-training process where human raters score model outputs and the model is trained to produce outputs humans prefer.

Reinforcement Learning from Human Feedback is the second-stage training process that turned raw pretrained language models into the helpful, polite, useful assistants you talk to today. Pretraining on internet text produces a model that knows things but doesn't necessarily say them helpfully. RLHF shapes the behavior on top.

The standard pipeline:

  1. Pretrain โ€” train a base model on a huge text corpus to predict next tokens.
  2. Supervised fine-tune (SFT) โ€” fine-tune on high-quality human-written examples of "good" responses.
  3. Reward model training โ€” humans rank model outputs. Train a separate "reward model" to predict those rankings.
  4. RL fine-tune โ€” use the reward model as a signal in reinforcement learning to push the LLM toward higher-ranked outputs.

What RLHF actually delivers:

  • Helpfulness โ€” answers questions instead of repeating the question.
  • Harmlessness โ€” refuses dangerous requests, avoids slurs, etc.
  • Style โ€” concise, structured responses instead of meandering completions.
  • Format compliance โ€” responds in JSON/Markdown when asked.

The successors to RLHF:

  • DPO (Direct Preference Optimization) โ€” simpler math, comparable results, now the default in many labs.
  • Constitutional AI (Anthropic) โ€” uses a written constitution + model-generated feedback to scale beyond what human raters can produce.
  • RLAIF โ€” RL from AI feedback, where another model plays the rater role.

The whole class of techniques is what people mean by "alignment." It's also the reason the same base model can produce wildly different products depending on who post-trained it โ€” Claude, GPT, Gemini, Llama-Instruct all started from comparable pretraining and diverged through their alignment recipes.

FAQ

Is RLHF why Claude refuses some requests?

Partly. RLHF (plus Anthropic's Constitutional AI) trains the model to decline content that violates its policies. The specific behaviors are tuned by the lab; users sometimes find them too cautious, sometimes not cautious enough.

Related terms

Want to actually build with this?

Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.

Build my stack โ†’