AI Evaluation Metrics: Measuring Model Quality
How do you know if your AI is good? Learn key metrics for evaluating classification, generation, and other AI tasks.
TL;DR
Different AI tasks need different metrics. Classification uses accuracy, precision, recall, and F1. Generation uses perplexity, BLEU, and human eval. Choose metrics that match your business goals.
Classification metrics
Accuracy:
- % of correct predictions
- Good for: Balanced datasets
- Misleading for: Imbalanced data
Precision:
- Of positive predictions, % actually positive
- Good for: Minimizing false positives (spam detection)
Recall (Sensitivity):
- Of actual positives, % correctly identified
- Good for: Catching all cases (disease detection)
F1 Score:
- Balance of precision and recall
- Good for: Imbalanced data
Language model metrics
Perplexity:
- How "surprised" model is by test data
- Lower = better
- Measures language modeling quality
BLEU:
- Compares generated text to reference
- Used for: Translation quality
- 0-1 scale (higher = better)
ROUGE:
- Overlap between generated and reference
- Used for: Summarization
- Variants: ROUGE-1, ROUGE-2, ROUGE-L
Generation quality
Human evaluation:
- People rate outputs
- Most reliable but expensive
- Criteria: Fluency, relevance, coherence
Coherence scores:
- Automated semantic similarity
- Measures logical flow
Diversity:
- How varied outputs are
- Avoids repetitive generation
Retrieval metrics
Precision@K:
- Of top K results, % relevant
Recall@K:
- Of all relevant items, % in top K
MAP (Mean Average Precision):
- Average precision across queries
NDCG:
- Considers ranking order
- Higher-ranked relevant items weighted more
Choosing the right metric
For classification:
- Balanced data → Accuracy
- Imbalanced → F1, Precision, Recall
- High-stakes → Focus on error types
For generation:
- Translation → BLEU
- Summarization → ROUGE
- Open-ended → Human eval
For retrieval:
- Search → NDCG
- Recommendations → Precision@K
Beyond metrics
- Speed/latency
- Cost per prediction
- Model size
- User satisfaction
- Business impact (revenue, conversions)
Common pitfalls
- Optimizing for metric that doesn't match goal
- Ignoring imbalanced data
- Relying on single metric
- Not testing on diverse data
- Overfitting to test set
What's next
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Model
The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Evaluation (Evals)
Systematically testing an AI system to measure how well it performs on specific tasks or criteria.
Related Guides
AI Workflows and Pipelines: Orchestrating Complex Tasks
IntermediateChain multiple AI steps together into workflows. Learn orchestration patterns, error handling, and tools for building AI pipelines.
Fine-Tuning Fundamentals: Customizing AI Models
IntermediateFine-tuning adapts pre-trained models to your specific use case. Learn when to fine-tune, how it works, and alternatives.
Retrieval Strategies for RAG Systems
IntermediateRAG systems retrieve relevant context before generating responses. Learn retrieval strategies, ranking, and optimization techniques.