RLHF (Reinforcement Learning from Human Feedback)

RLHF is the post-training process where human raters score model outputs and the model is trained to produce outputs humans prefer.

Reinforcement Learning from Human Feedback is the second-stage training process that turned raw pretrained language models into the helpful, polite, useful assistants you talk to today. Pretraining on internet text produces a model that knows things but doesn't necessarily say them helpfully. RLHF shapes the behavior on top.

The standard pipeline:

Pretrain — train a base model on a huge text corpus to predict next tokens.
Supervised fine-tune (SFT) — fine-tune on high-quality human-written examples of "good" responses.
Reward model training — humans rank model outputs. Train a separate "reward model" to predict those rankings.
RL fine-tune — use the reward model as a signal in reinforcement learning to push the LLM toward higher-ranked outputs.

What RLHF actually delivers:

Helpfulness — answers questions instead of repeating the question.
Harmlessness — refuses dangerous requests, avoids slurs, etc.
Style — concise, structured responses instead of meandering completions.
Format compliance — responds in JSON/Markdown when asked.

The successors to RLHF:

DPO (Direct Preference Optimization) — simpler math, comparable results, now the default in many labs.
Constitutional AI (Anthropic) — uses a written constitution + model-generated feedback to scale beyond what human raters can produce.
RLAIF — RL from AI feedback, where another model plays the rater role.

The whole class of techniques is what people mean by "alignment." It's also the reason the same base model can produce wildly different products depending on who post-trained it — Claude, GPT, Gemini, Llama-Instruct all started from comparable pretraining and diverged through their alignment recipes.

FAQ

Is RLHF why Claude refuses some requests?

Partly. RLHF (plus Anthropic's Constitutional AI) trains the model to decline content that violates its policies. The specific behaviors are tuned by the lab; users sometimes find them too cautious, sometimes not cautious enough.

Related terms

LLM (Large Language Model) — A Large Language Model is a neural network trained on huge volumes of text to predict the next token, which produces emergent capabilities like reasoning, code generation, and translation.
Fine-tuning — Fine-tuning is the process of further training a foundation model on your own examples so it learns to behave a specific way.

Want to actually build with this?

Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.

Build my stack →