Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt

Use this file to discover all available pages before exploring further.

PPO Training

Train language models using Proximal Policy Optimization (PPO) for reinforcement learning from human feedback (RLHF).

Overview

PPO training is a 2-step process:
  1. Train a Reward Model - Train a model to score responses (see Reward Modeling)
  2. Run PPO Training - Use the reward model to guide policy optimization

Quick Start

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./prompts.jsonl \
  --project-name ppo-model \
  --trainer ppo \
  --rl-reward-model-path ./reward-model

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="google/gemma-3-270m",
    data_path="./prompts.jsonl",
    project_name="ppo-model",

    trainer="ppo",
    rl_reward_model_path="./reward-model",

    # PPO hyperparameters
    rl_gamma=0.99,
    rl_gae_lambda=0.95,
    rl_kl_coef=0.1,
    rl_clip_range=0.2,
    rl_num_ppo_epochs=4,

    epochs=1,
    batch_size=4,
    lr=1e-5,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Requirements

PPO training requires either --rl-reward-model-path (path to a trained reward model) or --model-ref (reference model for KL divergence). At least one must be specified.

Parameters

Core PPO Parameters

ParameterCLI FlagDefaultDescription
rl_reward_model_path--rl-reward-model-pathNonePath to reward model (required)
rl_gamma--rl-gamma0.99Discount factor (0.9-0.99)
rl_gae_lambda--rl-gae-lambda0.95GAE lambda for advantage estimation (0.9-0.99)
rl_kl_coef--rl-kl-coef0.1KL divergence coefficient (0.01-0.5)
rl_value_loss_coef--rl-value-loss-coef1.0Value loss coefficient (0.5-2.0)
rl_clip_range--rl-clip-range0.2PPO clipping range (0.1-0.3)
rl_value_clip_range--rl-value-clip-range0.2Value function clipping range

Training Parameters

ParameterCLI FlagDefaultDescription
rl_num_ppo_epochs--rl-num-ppo-epochs4PPO epochs per batch
rl_chunk_size--rl-chunk-size128Training chunk size
rl_mini_batch_size--rl-mini-batch-size8Mini-batch size
rl_optimize_device_cache--rl-optimize-device-cacheTrueMemory optimization

Generation Parameters

ParameterCLI FlagDefaultDescription
rl_max_new_tokens--rl-max-new-tokens128Max tokens to generate
rl_top_k--rl-top-k50Top-k sampling
rl_top_p--rl-top-p1.0Top-p (nucleus) sampling
rl_temperature--rl-temperature1.0Generation temperature

Advanced Parameters

ParameterCLI FlagDefaultDescription
rl_reward_fn--rl-reward-fnNoneReward function: default, length_penalty, correctness, custom
rl_multi_objective--rl-multi-objectiveFalseEnable multi-objective rewards
rl_reward_weights--rl-reward-weightsNoneJSON weights for multi-objective
rl_env_type--rl-env-typeNoneRL environment type
rl_env_config--rl-env-configNoneJSON environment config

Data Format

PPO training uses prompts only (the model generates responses):
{"text": "What is machine learning?"}
{"text": "Explain quantum computing."}
{"text": "Write a haiku about coding."}

RL Environment Types

Three environment types are available:
EnvironmentDescription
text_generationStandard text generation with reward scoring
multi_objectiveMultiple reward components combined
preference_comparisonCompare generated responses

Multi-Objective Rewards

Enable multiple reward signals:
params = LLMTrainingParams(
    ...
    trainer="ppo",
    rl_multi_objective=True,
    rl_env_type="multi_objective",
    rl_reward_weights='{"correctness": 1.0, "formatting": 0.1}',
)

Example: Full RLHF Pipeline

Step 1: Train Reward Model

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./preferences.jsonl \
  --project-name reward-model \
  --trainer reward \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected

Step 2: Run PPO Training

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./prompts.jsonl \
  --project-name ppo-model \
  --trainer ppo \
  --rl-reward-model-path ./reward-model \
  --rl-kl-coef 0.1 \
  --rl-clip-range 0.2

Best Practices

  1. Start with a good base model - Fine-tune with SFT before PPO
  2. Use a well-trained reward model - Quality of rewards determines PPO success
  3. Monitor KL divergence - Too high means model is diverging too much from original
  4. Start with default hyperparameters - Adjust based on training dynamics
  5. Use small learning rates - PPO is sensitive to learning rate (1e-5 to 5e-6)

Next Steps

Reward Modeling

Train reward models

DPO Training

Simpler alternative to PPO

GRPO Training

RL with custom environments

RL Module

Low-level RL building blocks