TL;DR

Different AI tasks need different metrics. Classification uses accuracy, precision, recall, and F1. Generation uses perplexity, BLEU, and human eval. Choose metrics that match your business goals.

Classification metrics

Accuracy:

  • % of correct predictions
  • Good for: Balanced datasets
  • Misleading for: Imbalanced data

Precision:

  • Of positive predictions, % actually positive
  • Good for: Minimizing false positives (spam detection)

Recall (Sensitivity):

  • Of actual positives, % correctly identified
  • Good for: Catching all cases (disease detection)

F1 Score:

  • Balance of precision and recall
  • Good for: Imbalanced data

Language model metrics

Perplexity:

  • How "surprised" model is by test data
  • Lower = better
  • Measures language modeling quality

BLEU:

  • Compares generated text to reference
  • Used for: Translation quality
  • 0-1 scale (higher = better)

ROUGE:

  • Overlap between generated and reference
  • Used for: Summarization
  • Variants: ROUGE-1, ROUGE-2, ROUGE-L

Generation quality

Human evaluation:

  • People rate outputs
  • Most reliable but expensive
  • Criteria: Fluency, relevance, coherence

Coherence scores:

  • Automated semantic similarity
  • Measures logical flow

Diversity:

  • How varied outputs are
  • Avoids repetitive generation

Retrieval metrics

Precision@K:

  • Of top K results, % relevant

Recall@K:

  • Of all relevant items, % in top K

MAP (Mean Average Precision):

  • Average precision across queries

NDCG:

  • Considers ranking order
  • Higher-ranked relevant items weighted more

Choosing the right metric

For classification:

  • Balanced data → Accuracy
  • Imbalanced → F1, Precision, Recall
  • High-stakes → Focus on error types

For generation:

  • Translation → BLEU
  • Summarization → ROUGE
  • Open-ended → Human eval

For retrieval:

  • Search → NDCG
  • Recommendations → Precision@K

Beyond metrics

Common pitfalls

  • Optimizing for metric that doesn't match goal
  • Ignoring imbalanced data
  • Relying on single metric
  • Not testing on diverse data
  • Overfitting to test set

What's next

  • A/B Testing AI
  • Model Monitoring
  • Quality Assurance for AI