Documentation Index
Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt
Use this file to discover all available pages before exploring further.
Choosing the Right Model
The model you choose dramatically affects training time, quality, and hardware requirements. This guide helps you make the right choice.Model Size vs Hardware
The golden rule: A model needs roughly 2x its parameter count in GB of memory for training. A 7B model needs ~14GB VRAM for full training, or ~8GB with LoRA.
Quick Reference
| Your Hardware | Max Model Size | Recommended Models |
|---|---|---|
| MacBook Air M1 (8GB) | 500M - 1B | google/gemma-3-270m |
| MacBook Pro M2 (16GB) | 1B - 3B | google/gemma-2-2b, Llama-3.2-1B |
| MacBook Pro M3 Max (36-64GB) | 7B - 13B | Llama-3.2-8B, Mistral-7B |
| RTX 3060/3070 (8-12GB) | 1B - 3B | gemma-2-2b, Llama-3.2-3B |
| RTX 3090/4090 (24GB) | 7B - 13B | Llama-3.2-8B, Mistral-7B |
| A100 (40-80GB) | 30B - 70B | Llama-3.1-70B with quantization |
Memory Estimation Formula
- Full training: 7B × 16 = ~112GB (needs multi-GPU)
- With LoRA: 7B × 2 + 2GB = ~16GB
- With LoRA + int4: 7B × 0.5 + 2GB = ~6GB
Base vs Instruction-Tuned Models
This is one of the most important decisions you’ll make.Base Models (Pretrained)
Examples:google/gemma-2-2b, meta-llama/Llama-3.2-1B
What they are: Trained on raw text to predict the next word. They know language but don’t know how to be helpful.
When to use:
- You have lots of training data (10k+ examples)
- You want full control over the model’s behavior
- You’re training for a specific format (not chat)
- You want to create your own instruction style
Instruction-Tuned Models (IT/Instruct)
Examples:google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct
What they are: Base models that have already been trained to follow instructions and be helpful.
When to use:
- You have limited training data (100-5k examples)
- You want to refine existing helpful behavior
- You’re building a chatbot or assistant
- You want faster results with less data
Decision Matrix
| Situation | Use Base | Use Instruction-Tuned |
|---|---|---|
| Less than 1k examples | ✓ | |
| 1k - 10k examples | Depends | ✓ |
| 10k+ examples | ✓ | |
| Chat/assistant use case | ✓ | |
| Custom format (not chat) | ✓ | |
| Domain-specific (medical, legal) | ✓ | ✓ (either works) |
| Code generation | ✓ | |
| Creative writing | ✓ | ✓ (either works) |
Model Families
Google Gemma
Versions: Gemma 2, Gemma 3| Model | Size | Best For |
|---|---|---|
google/gemma-3-270m | 270M | Testing, learning, CPU/Apple Silicon |
google/gemma-2-2b | 2B | Consumer GPUs, good quality/speed balance |
google/gemma-2-9b | 9B | High quality on good hardware |
google/gemma-2-27b | 27B | Best Gemma quality, needs serious hardware |
-it suffix for instruction-tuned versions
Meta Llama
Versions: Llama 3.1, Llama 3.2| Model | Size | Best For |
|---|---|---|
meta-llama/Llama-3.2-1B | 1B | Mobile, edge devices |
meta-llama/Llama-3.2-3B | 3B | Consumer hardware |
meta-llama/Llama-3.1-8B | 8B | General purpose, excellent quality |
meta-llama/Llama-3.1-70B | 70B | Production quality, needs cloud GPU |
Mistral
| Model | Size | Best For |
|---|---|---|
mistralai/Mistral-7B-v0.3 | 7B | Great quality/efficiency ratio |
mistralai/Mixtral-8x7B | 8x7B | MoE architecture, fast inference |
Qwen (Alibaba)
| Model | Size | Best For |
|---|---|---|
Qwen/Qwen2.5-0.5B | 500M | Ultra-small, edge devices |
Qwen/Qwen2.5-3B | 3B | Balanced for consumer hardware |
Qwen/Qwen2.5-7B | 7B | Excellent multilingual, especially Chinese |
Searching for Models
In the wizard, use these commands:Sorting Options
| Option | When to Use |
|---|---|
| Trending | See what’s popular right now |
| Downloads | Most proven/used models |
| Likes | Community favorites |
| Recent | Newest releases |
Tips for Choosing
Start small, scale up
Start small, scale up
Always start with a smaller model like
gemma-3-270m. Get your pipeline working, verify your dataset is formatted correctly, then scale up to larger models.Don't chase the biggest model
Don't chase the biggest model
A well-trained 3B model often beats a poorly-trained 7B model. Focus on data quality first, then scale the model.
Match model to data
Match model to data
If you only have 500 examples, a 270M-1B model is plenty. Using a 7B model will just memorize your data instead of learning patterns.
Consider inference costs
Consider inference costs
If you’re deploying the model, remember: larger models cost more to run. A 1B model is 7x cheaper to serve than a 7B model.
Try instruction-tuned first
Try instruction-tuned first
Unless you have 10k+ high-quality examples, start with an instruction-tuned model. You’ll get better results faster.
Validating Your Choice
After selecting a model, the wizard validates it exists:Next Steps
Dataset Guide
Prepare your training data
LoRA for Large Models
Train bigger models on limited hardware