Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Distributed Training
AITraining supports multi-GPU training through Accelerate, with optional DeepSpeed Zero-3 optimization for large models.Requirements
| Component | Required | Install |
|---|---|---|
| Accelerate | Yes (included) | Included with AITraining |
| DeepSpeed | Optional | pip install deepspeed |
| Multiple GPUs | Yes | NVIDIA CUDA GPUs |
Distribution Backends
| Backend | Value | Description |
|---|---|---|
| DDP | ddp or None | PyTorch Distributed Data Parallel (default) |
| DeepSpeed | deepspeed | DeepSpeed Zero-3 with automatic sharding |
Quick Start
DDP (Default)
With multiple GPUs, DDP is used automatically:DeepSpeed
For large models, use DeepSpeed Zero-3:Python API
YAML Configuration
How It Works
Accelerate Launch
Training is launched through Accelerate:- AITraining detects available GPUs
- Launches training via
accelerate launch - For DeepSpeed, adds
--use_deepspeedand Zero-3 flags - Logs
accelerate envfor debugging
DDP Settings
When using DDP:ddp_find_unused_parameters=Falseis set for performance- Each GPU processes a portion of the batch
- Gradients are synchronized across GPUs
DeepSpeed Zero-3
When using DeepSpeed:- Model parameters are sharded across GPUs
- Uses
--deepspeed_multinode_launcher standardfor multi-node - Zero-3 configuration is applied automatically
- Model saving uses
accelerator.get_state_dict()with unwrapping
Multi-Node Training
For multi-node DeepSpeed training:--deepspeed_multinode_launcher standard flag is passed automatically.
Task-Specific Behavior
LLM Training
- Default: DDP when multiple GPUs detected
- DeepSpeed: Explicitly set
--distributed-backend deepspeed
Seq2Seq and VLM
- Auto-selects DeepSpeed for many-GPU cases
- Uses multi-GPU DDP for PEFT + quantization + bf16 combinations
Checkpointing with DeepSpeed
GPU Selection
Control which GPUs to use:Troubleshooting
Check Accelerate Environment
Common Issues
| Issue | Solution |
|---|---|
| DeepSpeed not found | pip install deepspeed |
| NCCL errors | Check GPU connectivity and CUDA version |
| OOM errors | Reduce batch size or use DeepSpeed |
| Slow training | Ensure GPUs are on same PCIe bus |
Next Steps
LoRA/PEFT
Efficient fine-tuning
Quantization
Reduce memory usage