Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Quantization
Quantization reduces memory usage by using lower precision for model weights.Quick Start
Python API
Quantization Options
| Option | Memory Reduction | Quality |
|---|---|---|
| None | 0% | Best |
| int8 | ~50% | Very Good |
| int4 | ~75% | Good |
Supported Tasks
Quantization is available for:| Task | Params Class | Notes |
|---|---|---|
| LLM | LLMTrainingParams | Full support |
| VLM | VLMTrainingParams | Full support |
| Seq2Seq | Seq2SeqParams | Full support |
4-bit (QLoRA)
Maximum memory savings:8-bit
Better quality, less savings:Memory Requirements
Llama 3.2 8B
| Config | VRAM Required |
|---|---|
| Full precision | ~64 GB |
| LoRA (fp16) | ~18 GB |
| LoRA + 8bit | ~12 GB |
| LoRA + 4bit | ~8 GB |
Gemma 2 27B
| Config | VRAM Required |
|---|---|
| Full precision | ~108 GB |
| LoRA + 4bit | ~20 GB |
Best Practices
Use with LoRA
Quantization requires PEFT/LoRA to be enabled:Adjust Learning Rate
Quantized training often benefits from a higher learning rate than the default (3e-5):
Use Flash Attention
Combine with Flash Attention for speed:Inference with Quantized Models
Load quantized models for inference:Platform Requirements
Apple Silicon (MPS) Note
Quantization is not compatible with Apple Silicon MPS. When you use quantization on a Mac with M1/M2/M3:- Training automatically falls back to CPU
- You’ll see a warning message explaining this
- For faster training on Mac, skip quantization and use LoRA alone
AUTOTRAIN_DISABLE_MPS=1- Force CPU trainingAUTOTRAIN_ENABLE_MPS=1- Force MPS even with quantization (may crash)
Quality Considerations
Quantization does reduce quality slightly. For critical applications:- Test on your specific task
- Compare with full-precision baseline
- Consider 8-bit if quality matters more
Next Steps
LoRA/PEFT
Efficient fine-tuning
Flash Attention
Speed optimizations