Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Flash Attention
Flash Attention 2 provides significant speedups for transformer training by optimizing memory access patterns.Requirements
Quick Start
Python API
Parameters
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
use_flash_attention_2 | --use-flash-attention-2 | False | Enable Flash Attention 2 |
attn_implementation | --attn-implementation | None | Override attention: eager, sdpa, flash_attention_2 |
Attention Implementation Options
| Option | Description |
|---|---|
eager | Standard PyTorch attention (default for some models) |
sdpa | Scaled Dot Product Attention (PyTorch 2.0+) |
flash_attention_2 | Flash Attention 2 (fastest, requires flash-attn) |
Model Compatibility
Supported Models
| Model Family | Flash Attention 2 | Notes |
|---|---|---|
| Llama | Yes | Full support |
| Mistral | Yes | Full support |
| Qwen | Yes | Full support |
| Phi | Yes | Full support |
| Gemma | No | Uses eager attention |
With Quantization
Combine Flash Attention with quantization for maximum efficiency:With Sequence Packing
Flash Attention enables efficient sequence packing:Sequence packing requires Flash Attention to be enabled.
Performance Benefits
| Configuration | Memory | Speed |
|---|---|---|
| Standard attention | Baseline | Baseline |
| SDPA | ~15% less | ~20% faster |
| Flash Attention 2 | ~40% less | ~2x faster |
Troubleshooting
Installation Errors
Ifpip install flash-attn fails:
Runtime Errors
“Flash Attention is not available”- Verify flash-attn is installed:
python -c "import flash_attn" - Ensure you’re on Linux with CUDA
- Check GPU compute capability (requires SM 80+, e.g., A100, H100)
- Some models (like Gemma) force eager attention
- Check model documentation for compatibility
Next Steps
Quantization
Combine with memory optimization
LoRA/PEFT
Efficient fine-tuning