LLM Training

The aitraining llm command trains large language models with support for multiple trainers and techniques.

Quick Start

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --trainer sft

Available Trainers

Trainer	Description
`default` / `sft` / `generic`	Supervised fine-tuning
`dpo`	Direct Preference Optimization
`orpo`	Odds Ratio Preference Optimization
`ppo`	Proximal Policy Optimization
`grpo`	Group Relative Policy Optimization (custom environments)
`reward`	Reward model training
`distillation`	Knowledge distillation

generic is an alias for default. All three (default, sft, generic) produce the same behavior.

PPO Trainer Requirements: PPO requires either --rl-reward-model-path (path to a trained reward model) or --model-ref (reference model for KL divergence). See PPO Training for full documentation.

GRPO Trainer Requirements: GRPO requires --rl-env-module (Python module path) and --rl-env-class (class name) for the custom environment. See GRPO Training for full documentation.

Parameter Groups

Parameters are organized into logical groups:

Basic Parameters

Parameter	Description	Default
`--model`	Base model to fine-tune	`google/gemma-3-270m`
`--data-path`	Path to training data	`data`
`--project-name`	Output directory name	`project-name`
`--train-split`	Training data split	`train`
`--valid-split`	Validation data split	`None`

Always specify these parameters: While --model, --data-path, and --project-name have defaults, you should always explicitly set them for your use case. The --project-name parameter sets the output folder - use a path like --project-name ./models/my-experiment to control where the trained model is saved.

Training Configuration

Parameter	Description	Default
`--trainer`	Training method	`default`
`--epochs`	Number of training epochs	`1`
`--batch-size`	Training batch size	`2`
`--lr`	Learning rate	`3e-5`
`--mixed-precision`	fp16/bf16/None	`None`
`--gradient-accumulation`	Accumulation steps	`4`
`--warmup-ratio`	Warmup ratio	`0.1`
`--optimizer`	Optimizer	`adamw_torch`
`--scheduler`	LR scheduler	`linear`
`--weight-decay`	Weight decay	`0.0`
`--max-grad-norm`	Max gradient norm	`1.0`
`--seed`	Random seed	`42`

Checkpointing & Evaluation

Parameter	Description	Default
`--eval-strategy`	When to evaluate (`epoch`, `steps`, `no`)	`epoch`
`--save-strategy`	When to save (`epoch`, `steps`, `no`)	`epoch`
`--save-steps`	Save every N steps (if save-strategy=steps)	`500`
`--save-total-limit`	Max checkpoints to keep	`1`
`--logging-steps`	Log every N steps (-1 for auto)	`-1`
`--resume-from-checkpoint`	Resume from checkpoint path, or `auto` to detect latest	`None`

Performance & Memory

Parameter	Description	Default
`--auto-find-batch-size`	Automatically find optimal batch size	`False`
`--disable-gradient-checkpointing`	Disable memory optimization	`False`
`--unsloth`	Use Unsloth for faster training (SFT only, llama/mistral/gemma/qwen2)	`False`
`--use-sharegpt-mapping`	Use Unsloth’s ShareGPT mapping	`False`
`--use-flash-attention-2`	Use Flash Attention 2 for faster training	`False`
`--attn-implementation`	Attention implementation (`eager`, `sdpa`, `flash_attention_2`)	`None`

Unsloth Requirements: Unsloth only works with sft/default trainers and specific model architectures (llama, mistral, gemma, qwen2). See Unsloth Integration for details.

Backend & Distribution

Parameter	Description	Default
`--backend`	Where to run (`local`, `spaces`)	`local`
`--distributed-backend`	Distribution backend (`ddp`, `deepspeed`)	`None`
`--ddp-timeout`	DDP/NCCL timeout in seconds	`7200`

Multi-GPU Behavior: With multiple GPUs and --distributed-backend not set, DDP is used automatically. Set --distributed-backend deepspeed for DeepSpeed Zero-3 optimization. Training is launched via Accelerate.

DeepSpeed Checkpointing: When using DeepSpeed, model saving uses accelerator.get_state_dict() and unwraps the model. PEFT adapter saving is handled differently under DeepSpeed.

PEFT/LoRA Parameters

Parameter	Description	Default
`--peft`	Enable LoRA training	`False`
`--lora-r`	LoRA rank	`16`
`--lora-alpha`	LoRA alpha	`32`
`--lora-dropout`	LoRA dropout	`0.05`
`--target-modules`	Modules to target	`all-linear`
`--quantization`	int4/int8 quantization	`None`
`--merge-adapter`	Merge LoRA after training	`True`

Data Processing

Parameter	Description	Default
`--text-column`	Text column name	`text`
`--block-size`	Max sequence length	`-1` (model default)
`--model-max-length`	Maximum model input length	Auto-detect from model
`--padding`	Padding side (`left` or `right`)	`right`
`--add-eos-token`	Append EOS token	`True`
`--chat-template`	Chat template to use	Auto by trainer
`--packing`	Enable sequence packing (requires flash attention)	`None`
`--auto-convert-dataset`	Auto-detect and convert dataset format	`False`
`--max-samples`	Limit dataset size for testing	`None`
`--save-processed-data`	Save processed data: `auto`, `local`, `hub`, `both`, `none`	`auto`

Chat Template Auto-Selection: SFT/DPO/ORPO/Reward trainers default to tokenizer (model’s built-in template). Use --chat-template none for plain text training.

Processed Data Saving: By default (auto), processed data is saved locally to {project}/data_processed/. If the source dataset was from the Hub, it’s also pushed as a private dataset. Original columns are renamed to _original_* to prevent conflicts.

Training Examples

SFT with LoRA

aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./conversations.jsonl \
  --project-name llama-sft \
  --trainer sft \
  --peft \
  --lora-r 16 \
  --lora-alpha 32 \
  --epochs 3 \
  --batch-size 4

DPO Training

For DPO, you must specify the column names for prompt, chosen, and rejected responses:

aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./preferences.jsonl \
  --project-name llama-dpo \
  --trainer dpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --dpo-beta 0.1 \
  --peft \
  --lora-r 16

DPO and ORPO require --prompt-text-column and --rejected-text-column to be specified.

ORPO Training

ORPO combines SFT and preference optimization:

aitraining llm --train \
  --model google/gemma-2-2b \
  --data-path ./preferences.jsonl \
  --project-name gemma-orpo \
  --trainer orpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --peft

GRPO Training

Train with Group Relative Policy Optimization using your own reward environment:

aitraining llm --train \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-num-generations 4 \
  --rl-max-new-tokens 256 \
  --peft \
  --lr 1e-5

GRPO generates multiple completions per prompt, scores them via your environment (0-1), and optimizes the policy. See GRPO Training for environment interface details.

Knowledge Distillation

Train a smaller model to mimic a larger one:

aitraining llm --train \
  --model google/gemma-3-270m \
  --teacher-model google/gemma-2-2b \
  --data-path ./prompts.jsonl \
  --project-name distilled-model \
  --use-distillation \
  --distill-temperature 3.0

Distillation defaults: --distill-temperature 3.0, --distill-alpha 0.7, --distill-max-teacher-length 512

Logging & Monitoring

Weights & Biases (Default)

W&B logging with LEET visualizer is enabled by default. The LEET visualizer shows real-time training metrics directly in your terminal.

# W&B is on by default - just run training
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model

To disable W&B or the visualizer:

# Disable W&B logging entirely
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log none

# Keep W&B but disable terminal visualizer
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --no-wandb-visualizer

TensorBoard

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log tensorboard

Push to Hugging Face Hub

Upload your trained model:

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --push-to-hub \
  --username your-username \
  --token $HF_TOKEN

The repository is created as private by default. By default, the repo will be named {username}/{project-name}.

Custom Repository Name or Organization

Use --repo-id to push to a specific repository, useful for:

Pushing to an organization instead of your personal account
Using a different repo name than your local project-name

# Push to an organization
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name ./local-training-output \
  --push-to-hub \
  --repo-id my-organization/my-custom-model-name \
  --token $HF_TOKEN

# Push to personal account with different name
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name ./experiment-v3 \
  --push-to-hub \
  --repo-id your-username/production-model \
  --token $HF_TOKEN

Parameter	Description	Default
`--push-to-hub`	Enable pushing to Hub	`False`
`--hub-private` / `--no-hub-private`	Create repo as private or public	`True` (private)
`--username`	HF username (for default repo naming)	`None`
`--token`	HF API token	`None`
`--repo-id`	Full repo ID (e.g., `org/model-name`)	`{username}/{project-name}`

When using --repo-id, you don’t need --username since the repo ID already specifies the destination. However, you still need --token for authentication.

Advanced Options

Hyperparameter Sweeps

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name sweep-experiment \
  --use-sweep \
  --sweep-backend optuna \
  --sweep-n-trials 10

Enhanced Evaluation

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,bleu"

View All Parameters

See all parameters for a specific trainer:

aitraining llm --trainer sft --help
aitraining llm --trainer dpo --help

Next Steps

YAML Configs

Use configuration files

DPO Training

Deep dive into DPO

LoRA/PEFT

Efficient fine-tuning

Distillation

Knowledge distillation

GRPO Training

RL with custom environments

CLI Basics

Configuration

Training Commands

Advanced Usage

Inference

LLM Training

LLM Training

Quick Start

Available Trainers

Parameter Groups

Basic Parameters

Training Configuration

Checkpointing & Evaluation

Performance & Memory

Backend & Distribution

PEFT/LoRA Parameters

Data Processing

Training Examples

SFT with LoRA

DPO Training

ORPO Training

GRPO Training

Knowledge Distillation

Logging & Monitoring

Weights & Biases (Default)

TensorBoard

Push to Hugging Face Hub

Custom Repository Name or Organization

Advanced Options

Hyperparameter Sweeps

Enhanced Evaluation

View All Parameters

Next Steps

YAML Configs

DPO Training

LoRA/PEFT

Distillation

GRPO Training

CLI Basics

Configuration

Training Commands

Advanced Usage

Inference

Documentation Index

​LLM Training

​Quick Start

​Available Trainers

​Parameter Groups

​Basic Parameters

​Training Configuration

​Checkpointing & Evaluation

​Performance & Memory

​Backend & Distribution

​PEFT/LoRA Parameters

​Data Processing

​Training Examples

​SFT with LoRA

​DPO Training

​ORPO Training

​GRPO Training

​Knowledge Distillation

​Logging & Monitoring

​Weights & Biases (Default)

​TensorBoard

​Push to Hugging Face Hub

​Custom Repository Name or Organization

​Advanced Options

​Hyperparameter Sweeps

​Enhanced Evaluation

​View All Parameters

​Next Steps

YAML Configs

DPO Training

LoRA/PEFT

Distillation

GRPO Training

LLM Training

Quick Start

Available Trainers

Parameter Groups

Basic Parameters

Training Configuration

Checkpointing & Evaluation

Performance & Memory

Backend & Distribution

PEFT/LoRA Parameters

Data Processing

Training Examples

SFT with LoRA

DPO Training

ORPO Training

GRPO Training

Knowledge Distillation

Logging & Monitoring

Weights & Biases (Default)

TensorBoard

Push to Hugging Face Hub

Custom Repository Name or Organization

Advanced Options

Hyperparameter Sweeps

Enhanced Evaluation

View All Parameters

Next Steps