Evaluation Metrics

You can’t improve what you don’t measure. Here’s how to tell if your model is actually working.

Classification Metrics

Accuracy

The simplest metric - what percentage did you get right?

Accuracy = Correct Predictions / Total Predictions

Example: 90/100 correct = 90% accuracy Problem: Misleading with imbalanced data. If 95% of emails are not spam, a model that always says “not spam” gets 95% accuracy.

Precision & Recall

Precision: Of the ones you predicted positive, how many were actually positive? Recall: Of all the actual positives, how many did you find? Example for spam detection:

Precision: Of emails marked spam, how many were actually spam?
Recall: Of all spam emails, how many did you catch?

F1 Score

Combines precision and recall into one number.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Use when you care about both false positives and false negatives equally.

Generation Metrics

Perplexity

How surprised the model is by the test data. Lower is better.

Good model: Perplexity = 10-50
Bad model: Perplexity = 100+

BLEU Score

Compares generated text to reference text. Used for translation, summarization.

BLEU = 0: No overlap
BLEU = 1: Perfect match
BLEU > 0.3: Usually decent

Human Evaluation

Sometimes the best metric is asking humans:

Is this response helpful?
Does this summary capture the main points?
Is this translation natural?

Loss Curves

Training Loss vs Validation Loss

Watch both during training: Good pattern:

Both decrease
Stay close together
Plateau eventually

Overfitting:

Training loss keeps dropping
Validation loss increases
Gap widens

Underfitting:

Both stay high
Little improvement
Need more capacity or data

Task-Specific Metrics

Image Classification

Top-1 Accuracy: Correct class is the top prediction
Top-5 Accuracy: Correct class in top 5 predictions
Confusion Matrix: See which classes get confused

Object Detection

mAP (mean Average Precision): Overall detection quality
IoU (Intersection over Union): How well boxes overlap

NER/Token Classification

Entity-level F1: Complete entities correct
Token-level accuracy: Individual tokens correct

Quick Reference

Task	Primary Metric	Good Score
Binary Classification	F1 Score	> 0.8
Multi-class Classification	Accuracy	> 0.9
Generation	Perplexity	< 50
Translation	BLEU	> 0.3
Summarization	ROUGE	> 0.4
Q&A	Exact Match	> 0.7

Enhanced Evaluation in AITraining

AITraining supports enhanced evaluation with multiple built-in and custom metrics.

Enable Enhanced Evaluation

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data.jsonl \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,bleu"

Available Metrics

Metric	Description
`perplexity`	Model uncertainty (lower is better)
`bleu`	N-gram overlap with reference
`rouge`	Recall-oriented understudy for gisting evaluation
`accuracy`	Classification accuracy
`f1`	F1 score for classification

Python API

from autotrain.trainers.clm.params import LLMTrainingParams

params = LLMTrainingParams(
    model="google/gemma-3-270m",
    data_path="./data.jsonl",
    project_name="my-model",

    use_enhanced_eval=True,
    eval_metrics=["perplexity", "bleu"],
)

Custom Metrics

from autotrain.metrics import register_metric

@register_metric("my_custom_metric")
def compute_custom_metric(predictions, references):
    # Your custom scoring logic
    score = ...
    return {"my_custom_metric": score}

# Then use it in training
params = LLMTrainingParams(
    ...
    use_enhanced_eval=True,
    eval_metrics=["perplexity", "my_custom_metric"],
)

Practical Tips

Always use validation set - Never evaluate on training data
Consider the task - Accuracy isn’t always best
Watch trends - Improving is more important than absolute numbers
Multiple metrics - No single metric tells the whole story

Red Flags

Training accuracy 100%, validation 60% → Overfitting
All metrics stuck → Learning rate might be wrong
Metrics jumping around → Batch size too small
Perfect scores immediately → Data leak or bug

Rethinking AI Evaluation

Traditional benchmarks may not capture true intelligence. Our research explores new approaches to evaluating AI reasoning.

The Child Benchmark: A New Way to Test AGI

Why we should evaluate AI like we evaluate children’s development

Next Steps

Fine-tuning vs Full Training

Choose your approach

Hyperparameters

Optimize your settings

Getting Started

AI Training Fundamentals

Core Concepts

Interface Selection

Evaluation Metrics

Evaluation Metrics

Classification Metrics

Accuracy

Precision & Recall

F1 Score

Generation Metrics

Perplexity

BLEU Score

Human Evaluation

Loss Curves

Training Loss vs Validation Loss

Task-Specific Metrics

Image Classification

Object Detection

NER/Token Classification

Quick Reference

Enhanced Evaluation in AITraining

Enable Enhanced Evaluation

Available Metrics

Python API

Custom Metrics

Practical Tips

Red Flags

Rethinking AI Evaluation

The Child Benchmark: A New Way to Test AGI

Next Steps

Fine-tuning vs Full Training

Hyperparameters

Getting Started

AI Training Fundamentals

Core Concepts

Interface Selection

Documentation Index

​Evaluation Metrics

​Classification Metrics

​Accuracy

​Precision & Recall

​F1 Score

​Generation Metrics

​Perplexity

​BLEU Score

​Human Evaluation

​Loss Curves

​Training Loss vs Validation Loss

​Task-Specific Metrics

​Image Classification

​Object Detection

​NER/Token Classification

​Quick Reference

​Enhanced Evaluation in AITraining

​Enable Enhanced Evaluation

​Available Metrics

​Python API

​Custom Metrics

​Practical Tips

​Red Flags

​Rethinking AI Evaluation

The Child Benchmark: A New Way to Test AGI

​Next Steps

Fine-tuning vs Full Training

Hyperparameters

Evaluation Metrics

Classification Metrics

Accuracy

Precision & Recall

F1 Score

Generation Metrics

Perplexity

BLEU Score

Human Evaluation

Loss Curves

Training Loss vs Validation Loss

Task-Specific Metrics

Image Classification

Object Detection

NER/Token Classification

Quick Reference

Enhanced Evaluation in AITraining

Enable Enhanced Evaluation

Available Metrics

Python API

Custom Metrics

Practical Tips

Red Flags

Rethinking AI Evaluation

Next Steps