Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt

Use this file to discover all available pages before exploring further.

DPO Training

Direct Preference Optimization aligns models with human preferences without reward modeling.

What is DPO?

DPO (Direct Preference Optimization) is a simpler alternative to RLHF. Instead of training a separate reward model, DPO directly optimizes the model to prefer chosen responses over rejected ones.

Quick Start

aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./preferences.jsonl \
  --project-name llama-dpo \
  --trainer dpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --dpo-beta 0.1 \
  --peft
DPO requires --prompt-text-column and --rejected-text-column. The --text-column defaults to "text", so only specify it if your chosen column has a different name.

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="meta-llama/Llama-3.2-1B",
    data_path="./preferences.jsonl",
    project_name="llama-dpo",

    trainer="dpo",
    prompt_text_column="prompt",
    text_column="chosen",
    rejected_text_column="rejected",
    dpo_beta=0.1,
    max_completion_length=None,  # Default: None

    epochs=1,
    batch_size=2,
    gradient_accumulation=4,
    lr=5e-6,

    peft=True,
    lora_r=16,
    lora_alpha=32,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Data Format

DPO requires preference pairs: a prompt with chosen and rejected responses.
{
  "prompt": "What is the capital of France?",
  "chosen": "The capital of France is Paris.",
  "rejected": "France's capital is London."
}

Multiple Turns

For multi-turn preference data, provide the prompt as a messages list. The prompt marks where the shared context ends — everything after is the diverging chosen/rejected trajectory.
{
  "prompt": [
    {"role": "user", "content": "Book me a hotel"},
    {"role": "assistant", "content": "Sure, let me search."}
  ],
  "chosen": [
    {"role": "user", "content": "Book me a hotel"},
    {"role": "assistant", "content": "Sure, let me search."},
    {"role": "user", "content": "In Paris please"},
    {"role": "assistant", "content": "Done, booked Hotel Lumiere."}
  ],
  "rejected": [
    {"role": "user", "content": "Book me a hotel"},
    {"role": "assistant", "content": "Sure, let me search."},
    {"role": "user", "content": "In Paris please"},
    {"role": "assistant", "content": "I cannot do that."}
  ]
}

Parameters

ParameterDescriptionDefault
trainerSet to "dpo"Required
dpo_betaKL penalty coefficient0.1
max_completion_lengthMax tokens for responseNone
model_refReference model (optional)None (uses base model)

Beta

The beta parameter controls how much the model can deviate from the reference:
  • 0.01-0.05: Aggressive optimization (may overfit)
  • 0.1: Standard (recommended)
  • 0.5-1.0: Conservative (stays close to reference)
# Conservative training
params = LLMTrainingParams(
    ...
    trainer="dpo",
    dpo_beta=0.5,  # Higher = more conservative
)

Reference Model

When model_ref is None (the default), DPO uses the initial model as the reference. You can specify a different one:
params = LLMTrainingParams(
    model="meta-llama/Llama-3.2-1B",  # Model to train
    model_ref="meta-llama/Llama-3.2-1B-base",  # Reference model
    ...
    trainer="dpo",
)

Training Tips

Use LoRA

DPO works well with LoRA:
params = LLMTrainingParams(
    ...
    trainer="dpo",
    peft=True,
    lora_r=16,
    lora_alpha=32,
    lora_dropout=0.05,
)

Lower Learning Rate

DPO is sensitive to learning rate:
params = LLMTrainingParams(
    ...
    trainer="dpo",
    lr=5e-7,  # Much lower than SFT
)

Fewer Epochs

DPO typically needs fewer epochs:
params = LLMTrainingParams(
    ...
    trainer="dpo",
    epochs=1,  # Often 1-3 epochs is enough
)

Example: Helpful Assistant

Create a more helpful assistant:
params = LLMTrainingParams(
    model="meta-llama/Llama-3.2-1B",
    data_path="./helpfulness_prefs.jsonl",
    project_name="helpful-assistant",

    trainer="dpo",
    dpo_beta=0.1,
    max_completion_length=512,

    epochs=1,
    batch_size=2,
    gradient_accumulation=8,
    lr=1e-6,

    peft=True,
    lora_r=32,
    lora_alpha=64,

    log="wandb",
)

DPO vs ORPO

AspectDPOORPO
Reference modelRequiredNot required
Memory usageHigherLower
Training speedSlowerFaster
Use caseFine-grained alignmentCombined SFT + alignment

Collecting Preference Data

Human Annotation

  1. Generate multiple responses per prompt
  2. Have annotators rank responses
  3. Create chosen/rejected pairs

LLM-as-Judge

def create_preference_pairs(prompts, model_responses):
    """Use GPT-4 to judge which response is better."""
    # ... generate judgments
    return {"prompt": p, "chosen": better, "rejected": worse}

Next Steps

ORPO Training

Combined SFT + alignment

Reward Modeling

Train reward models