Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.monostate.ai/llms.txt

Use this file to discover all available pages before exploring further.

Evaluation Metrics

You can’t improve what you don’t measure. Here’s how to tell if your model is actually working.

Classification Metrics

Accuracy

The simplest metric - what percentage did you get right?
Accuracy = Correct Predictions / Total Predictions
Example: 90/100 correct = 90% accuracy Problem: Misleading with imbalanced data. If 95% of emails are not spam, a model that always says “not spam” gets 95% accuracy.

Precision & Recall

Precision: Of the ones you predicted positive, how many were actually positive? Recall: Of all the actual positives, how many did you find? Example for spam detection:
  • Precision: Of emails marked spam, how many were actually spam?
  • Recall: Of all spam emails, how many did you catch?

F1 Score

Combines precision and recall into one number.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Use when you care about both false positives and false negatives equally.

Generation Metrics

Perplexity

How surprised the model is by the test data. Lower is better.
  • Good model: Perplexity = 10-50
  • Bad model: Perplexity = 100+

BLEU Score

Compares generated text to reference text. Used for translation, summarization.
  • BLEU = 0: No overlap
  • BLEU = 1: Perfect match
  • BLEU > 0.3: Usually decent

Human Evaluation

Sometimes the best metric is asking humans:
  • Is this response helpful?
  • Does this summary capture the main points?
  • Is this translation natural?

Loss Curves

Training Loss vs Validation Loss

Watch both during training: Good pattern:
  • Both decrease
  • Stay close together
  • Plateau eventually
Overfitting:
  • Training loss keeps dropping
  • Validation loss increases
  • Gap widens
Underfitting:
  • Both stay high
  • Little improvement
  • Need more capacity or data

Task-Specific Metrics

Image Classification

  • Top-1 Accuracy: Correct class is the top prediction
  • Top-5 Accuracy: Correct class in top 5 predictions
  • Confusion Matrix: See which classes get confused

Object Detection

  • mAP (mean Average Precision): Overall detection quality
  • IoU (Intersection over Union): How well boxes overlap

NER/Token Classification

  • Entity-level F1: Complete entities correct
  • Token-level accuracy: Individual tokens correct

Quick Reference

TaskPrimary MetricGood Score
Binary ClassificationF1 Score> 0.8
Multi-class ClassificationAccuracy> 0.9
GenerationPerplexity< 50
TranslationBLEU> 0.3
SummarizationROUGE> 0.4
Q&AExact Match> 0.7

Enhanced Evaluation in AITraining

AITraining supports enhanced evaluation with multiple built-in and custom metrics.

Enable Enhanced Evaluation

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data.jsonl \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,bleu"

Available Metrics

MetricDescription
perplexityModel uncertainty (lower is better)
bleuN-gram overlap with reference
rougeRecall-oriented understudy for gisting evaluation
accuracyClassification accuracy
f1F1 score for classification

Python API

from autotrain.trainers.clm.params import LLMTrainingParams

params = LLMTrainingParams(
    model="google/gemma-3-270m",
    data_path="./data.jsonl",
    project_name="my-model",

    use_enhanced_eval=True,
    eval_metrics=["perplexity", "bleu"],
)

Custom Metrics

Register custom metrics for specialized evaluation:
from autotrain.metrics import register_metric

@register_metric("my_custom_metric")
def compute_custom_metric(predictions, references):
    # Your custom scoring logic
    score = ...
    return {"my_custom_metric": score}

# Then use it in training
params = LLMTrainingParams(
    ...
    use_enhanced_eval=True,
    eval_metrics=["perplexity", "my_custom_metric"],
)

Practical Tips

  1. Always use validation set - Never evaluate on training data
  2. Consider the task - Accuracy isn’t always best
  3. Watch trends - Improving is more important than absolute numbers
  4. Multiple metrics - No single metric tells the whole story

Red Flags

  • Training accuracy 100%, validation 60% → Overfitting
  • All metrics stuck → Learning rate might be wrong
  • Metrics jumping around → Batch size too small
  • Perfect scores immediately → Data leak or bug

Rethinking AI Evaluation

Traditional benchmarks may not capture true intelligence. Our research explores new approaches to evaluating AI reasoning.

The Child Benchmark: A New Way to Test AGI

Why we should evaluate AI like we evaluate children’s development

Next Steps

Fine-tuning vs Full Training

Choose your approach

Hyperparameters

Optimize your settings