Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
PPO Training
Train language models using Proximal Policy Optimization (PPO) for reinforcement learning from human feedback (RLHF).Overview
PPO training is a 2-step process:- Train a Reward Model - Train a model to score responses (see Reward Modeling)
- Run PPO Training - Use the reward model to guide policy optimization
Quick Start
Python API
Requirements
Parameters
Core PPO Parameters
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
rl_reward_model_path | --rl-reward-model-path | None | Path to reward model (required) |
rl_gamma | --rl-gamma | 0.99 | Discount factor (0.9-0.99) |
rl_gae_lambda | --rl-gae-lambda | 0.95 | GAE lambda for advantage estimation (0.9-0.99) |
rl_kl_coef | --rl-kl-coef | 0.1 | KL divergence coefficient (0.01-0.5) |
rl_value_loss_coef | --rl-value-loss-coef | 1.0 | Value loss coefficient (0.5-2.0) |
rl_clip_range | --rl-clip-range | 0.2 | PPO clipping range (0.1-0.3) |
rl_value_clip_range | --rl-value-clip-range | 0.2 | Value function clipping range |
Training Parameters
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
rl_num_ppo_epochs | --rl-num-ppo-epochs | 4 | PPO epochs per batch |
rl_chunk_size | --rl-chunk-size | 128 | Training chunk size |
rl_mini_batch_size | --rl-mini-batch-size | 8 | Mini-batch size |
rl_optimize_device_cache | --rl-optimize-device-cache | True | Memory optimization |
Generation Parameters
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
rl_max_new_tokens | --rl-max-new-tokens | 128 | Max tokens to generate |
rl_top_k | --rl-top-k | 50 | Top-k sampling |
rl_top_p | --rl-top-p | 1.0 | Top-p (nucleus) sampling |
rl_temperature | --rl-temperature | 1.0 | Generation temperature |
Advanced Parameters
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
rl_reward_fn | --rl-reward-fn | None | Reward function: default, length_penalty, correctness, custom |
rl_multi_objective | --rl-multi-objective | False | Enable multi-objective rewards |
rl_reward_weights | --rl-reward-weights | None | JSON weights for multi-objective |
rl_env_type | --rl-env-type | None | RL environment type |
rl_env_config | --rl-env-config | None | JSON environment config |
Data Format
PPO training uses prompts only (the model generates responses):RL Environment Types
Three environment types are available:| Environment | Description |
|---|---|
text_generation | Standard text generation with reward scoring |
multi_objective | Multiple reward components combined |
preference_comparison | Compare generated responses |
Multi-Objective Rewards
Enable multiple reward signals:Example: Full RLHF Pipeline
Step 1: Train Reward Model
Step 2: Run PPO Training
Best Practices
- Start with a good base model - Fine-tune with SFT before PPO
- Use a well-trained reward model - Quality of rewards determines PPO success
- Monitor KL divergence - Too high means model is diverging too much from original
- Start with default hyperparameters - Adjust based on training dynamics
- Use small learning rates - PPO is sensitive to learning rate (1e-5 to 5e-6)
Next Steps
Reward Modeling
Train reward models
DPO Training
Simpler alternative to PPO
GRPO Training
RL with custom environments
RL Module
Low-level RL building blocks