Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Knowledge Distillation
Train smaller, faster models that mimic the behavior of larger teacher models.What is Distillation?
Knowledge distillation transfers knowledge from a large “teacher” model to a smaller “student” model. The student learns to produce similar outputs to the teacher, gaining capabilities beyond what it could learn from data alone.Quick Start
Python API
Parameters
| Parameter | Description | Default |
|---|---|---|
use_distillation | Enable distillation | False |
teacher_model | Path to teacher model | Required when use_distillation=True |
distill_temperature | Softmax temperature (2.0-4.0 recommended) | 3.0 |
distill_alpha | Distillation loss weight | 0.7 |
distill_max_teacher_length | Max tokens for teacher | 512 |
teacher_prompt_template | Template for teacher prompts | None |
student_prompt_template | Template for student prompts | "{input}" |
Temperature
Higher temperature makes the teacher’s probability distribution softer, making it easier for the student to learn:1.0: Normal probabilities2.0-4.0: Softer, more teachable (recommended)>4.0: Very soft, may lose precision
Alpha
Controls balance between distillation and standard loss:0.0: Only standard loss (no distillation)0.5: Equal balance0.7: Default (more weight on distillation)1.0: Only distillation loss
Prompt Templates
Customize how prompts are formatted for teacher and student models:{input} as the placeholder for the actual prompt text.
Data Format
Simple prompts work well for distillation:Best Practices
Choose Models Wisely
- Teacher should be significantly larger (4x+ parameters)
- Same architecture family often works best
- Teacher should be capable at the target task
Temperature Tuning
Recommended temperature range is 2.0-4.0. Values above 4.0 may lose precision.
Training Duration
Distillation often benefits from longer training:Example: API Assistant
Distill a large model’s API knowledge:Comparison
Without Distillation
With Distillation
Use Cases
- Deployment: Create fast models for production
- Edge devices: Run on mobile/embedded systems
- Cost reduction: Lower inference costs
- Specialization: Focus large model knowledge on specific domain
Next Steps
DPO Training
Preference optimization
LoRA/PEFT
Efficient fine-tuning