Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Batch Processing
Run multiple training experiments systematically.
Multiple Configs
Sequential Runs
Run different configs in sequence:
for config in configs/*.yaml; do
echo "Running $config..."
aitraining --config "$config"
done
Parallel Runs
Run on different GPUs simultaneously:
CUDA_VISIBLE_DEVICES=0 aitraining --config config1.yaml &
CUDA_VISIBLE_DEVICES=1 aitraining --config config2.yaml &
wait
Parameter Sweeps
Manual Sweep
for lr in 1e-5 2e-5 5e-5; do
for bs in 4 8 16; do
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name "exp-lr${lr}-bs${bs}" \
--lr $lr \
--batch-size $bs
done
done
Built-in Sweeps
Use the hyperparameter sweep feature:
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name sweep-experiment \
--use-sweep \
--sweep-backend optuna \
--sweep-n-trials 20
Experiment Scripts
Basic Script
#!/bin/bash
# experiments.sh
MODELS=(
"google/gemma-3-270m"
"google/gemma-2-2b"
)
TRAINERS=(
"sft"
"dpo"
)
for model in "${MODELS[@]}"; do
for trainer in "${TRAINERS[@]}"; do
name=$(basename $model)-$trainer
aitraining llm --train \
--model $model \
--data-path ./data \
--trainer $trainer \
--project-name "$name"
done
done
With Logging
#!/bin/bash
# run_experiments.sh
LOG_DIR="logs/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$LOG_DIR"
run_experiment() {
local config=$1
local name=$(basename "$config" .yaml)
echo "[$(date)] Starting $name"
aitraining --config "$config" 2>&1 | tee "$LOG_DIR/$name.log"
echo "[$(date)] Finished $name"
}
for config in experiments/*.yaml; do
run_experiment "$config"
done
echo "All experiments complete. Logs in $LOG_DIR"
Job Management
Background Jobs
# Start in background
nohup aitraining --config config.yaml > training.log 2>&1 &
echo $! > training.pid
# Check status
ps -p $(cat training.pid)
# Stop job
kill $(cat training.pid)
tmux Sessions
# Create session
tmux new-session -d -s training
# Run training
tmux send-keys -t training "aitraining --config config.yaml" Enter
# Attach to see output
tmux attach -t training
# Detach: Ctrl+B, D
Results Collection
Aggregate Metrics
import json
from pathlib import Path
results = []
for exp_dir in Path("experiments").glob("*/"):
# Training state is saved in trainer_state.json
state_file = exp_dir / "trainer_state.json"
if state_file.exists():
with open(state_file) as f:
state = json.load(f)
results.append({
"experiment": exp_dir.name,
"best_metric": state.get("best_metric"),
"global_step": state.get("global_step"),
"epoch": state.get("epoch"),
})
# Sort by best_metric (typically eval_loss)
results.sort(key=lambda x: x.get("best_metric") or float("inf"))
# Print best
print("Best experiment:", results[0]["experiment"])
Compare with W&B
When using --log wandb, all experiments are tracked. Set the W&B project via environment variable:
# Set W&B project for all runs
export WANDB_PROJECT=my-experiments
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name exp-1 \
--log wandb
View comparisons in the W&B dashboard.
Next Steps
Pipeline Automation
Build training pipelines
Logging & Debugging
Monitor and debug training