Training

DPO

Direct Preference Optimization. A simpler alternative to RLHF for tuning model behavior.

DPO (Direct Preference Optimization) is a fine-tuning technique that trains a model to prefer one output over another using paired preference data (chosen vs. rejected). Unlike full RLHF it does not need a separate reward model or reinforcement-learning loop. In practice DPO is easier to set up than RLHF and almost as effective for most preference-tuning use cases.

Related terms

Fine-tuning
RLHF

Building with DPO?

We ship production AI systems built around concepts like this every quarter. Send a brief and get a written proposal in 48 hours.

Send a brief →