Training

RLHF

Reinforcement Learning from Human Feedback. The classic preference-tuning method.

RLHF trains a reward model on human preference pairs, then uses reinforcement learning (PPO) to update the LLM to maximize that reward. It is what made early ChatGPT and Claude useful. In 2026, DPO has largely replaced RLHF for ease of setup, but RLHF still wins at the highest scale and for the most subtle preference targets.

Related terms

DPO
Fine-tuning

Building with RLHF?

We ship production AI systems built around concepts like this every quarter. Send a brief and get a written proposal in 48 hours.

Send a brief →