- Home
- /Guides
- /Data & Evaluation
- /Evaluations 201: Golden Sets, Rubrics, and Automated Eval
Evaluations 201: Golden Sets, Rubrics, and Automated Eval
Build rigorous evaluation systems for AI. Create golden datasets, define rubrics, automate testing, and measure improvements.
TL;DR
Evaluations are automated tests that measure AI quality. Build a golden dataset (test cases with expected outputs), define rubrics (scoring criteria), and run them continuously. Use LLM-as-judge for scalable evaluation, regression tests to catch failures, and custom metrics for domain-specific quality. Good evals are the difference between shipping confidently and crossing your fingers.
Why rigorous evaluation matters
You've deployed your AI feature. It works great in demos. Then production traffic hits and users report nonsensical answers, missed information, or inappropriate tone. What happened?
Manual spot-checks don't scale. You can't test every edge case by hand. You need systematic evaluationāautomated tests that run on every code change, catching regressions before they reach users.
Without evals, you're flying blind. With them, you can iterate confidently, knowing exactly when you've improved or broken something.
Building golden datasets
A golden dataset is your test suite for AI: input examples paired with expected outputs or quality scores.
What makes a good golden dataset?
1. Representative: Covers real user scenarios, not just easy cases
2. Diverse: Edge cases, different input types, varying difficulty
3. Labeled: Every example has a ground truth (correct answer, quality score, or both)
4. Versioned: Track changes over time
Sourcing examples
Start with production data. Sample real user queries (remove PII first). These reflect actual usage patterns you need to handle.
Add adversarial cases. Jailbreak attempts, gibberish, hostile input. Test safety guardrails.
Include edge cases. Corner cases that failed before, ambiguous queries, multi-step reasoning.
Balance difficulty. Mix easy wins (to catch catastrophic failures) with hard problems (to measure improvement).
Aim for 50-200 examples minimum. Too few and you miss patterns. Thousands is better but start small.
Labeling strategies
Human labeling: Gold standard but expensive. Use for critical evaluations (safety, legal compliance).
AI-assisted labeling: Have an LLM propose labels, humans review. Faster and cheaper.
Synthetic data: Generate examples programmatically. Good for edge cases (dates, numbers, formats) but lacks real-world messiness.
Example: Customer support bot
{
"id": "001",
"input": "How do I reset my password?",
"expected_topics": ["password_reset", "authentication"],
"expected_sentiment": "helpful",
"must_include": ["password reset link", "email"],
"must_not_include": ["billing", "refund"],
"quality_score": 5
}
Versioning
Track dataset versions like code:
v1.0: Initial 50 examplesv1.1: Added 20 edge cases from production failuresv2.0: Relabeled with new rubric
Use git or a dataset versioning tool (DVC, Weights & Biases datasets). Never evaluate on old versions without documenting it.
Defining evaluation rubrics
A rubric is your scoring criteria: how you judge quality.
Common rubric dimensions
Accuracy: Is the answer factually correct?
- Binary (right/wrong) or scalar (1-5)
- Example: "Did the bot correctly identify the customer's issue?"
Relevance: Does it answer the question?
- Example: User asks about refunds, bot talks about shipping
Completeness: Does it cover all aspects?
- Example: "Reset password" should mention both email link and backup codes
Safety: Does it avoid harmful, biased, or inappropriate content?
- Red lines: toxicity, PII leakage, jailbreak success
Style: Does it match desired tone, length, format?
- Example: "Friendly but professional, under 100 words"
Latency: Did it respond fast enough?
- Example: 95th percentile under 2 seconds
Rubric example: Code generation
metrics:
- name: correctness
type: binary
criteria: "Code runs without errors and produces correct output"
- name: best_practices
type: scale_1_5
criteria: "Follows language conventions, handles errors, includes comments"
- name: security
type: binary
criteria: "No SQL injection, XSS, or other vulnerabilities"
- name: efficiency
type: scale_1_5
criteria: "Appropriate algorithm choice, no obvious performance issues"
Multi-dimensional scoring
Don't rely on a single score. Track multiple dimensions:
{
"accuracy": 4, # Got the right answer
"relevance": 5, # Directly addressed the question
"completeness": 3, # Missed one edge case
"safety": 5, # No issues
"style": 4, # Slightly verbose
"overall": 4.2 # Weighted average
}
Evaluation strategies
1. Exact match
Does the output exactly match the expected answer?
When to use: Structured output (JSON, code, classifications), fact lookup
Example:
def exact_match(output, expected):
return output.strip() == expected.strip()
Pros: Simple, unambiguous
Cons: Brittleāminor variations fail even if semantically correct
2. Semantic similarity
How close is the meaning?
Method: Embed both output and expected answer, measure cosine similarity.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(output, expected):
emb1 = model.encode(output)
emb2 = model.encode(expected)
similarity = cosine_similarity(emb1, emb2)
return similarity
Pros: Handles paraphrasing, focuses on meaning
Cons: Can miss factual errors if phrasing is similar
3. LLM-as-judge
Use an LLM to evaluate another LLM's output.
When to use: Open-ended text, complex quality criteria, nuanced evaluation
Example prompt:
You are evaluating an AI assistant's answer.
Question: {question}
Answer: {answer}
Expected topics: {expected_topics}
Rate the answer on:
1. Accuracy (1-5): Is it factually correct?
2. Relevance (1-5): Does it answer the question?
3. Completeness (1-5): Does it cover all necessary points?
Output as JSON:
{
"accuracy": <score>,
"relevance": <score>,
"completeness": <score>,
"reasoning": "<brief explanation>"
}
Pros: Flexible, understands nuance, scales better than human eval
Cons: Can be biased, inconsistent, expensive
Tips:
- Use strong models (GPT-4, Claude) as judges
- Provide clear rubrics in the prompt
- Validate with human spot-checks
- Use temperature=0 for consistency
4. Custom metrics
Domain-specific quality measures.
Example: Customer support
def eval_support_response(output, context):
metrics = {}
# Check if solution steps are included
metrics['has_solution'] = bool(re.search(r'\d+\.', output))
# Check tone (using sentiment analyzer)
metrics['tone'] = get_sentiment(output) # 'positive', 'neutral', 'negative'
# Check for required elements
required = ['ticket number', 'email', 'resolution']
metrics['completeness'] = sum(term in output.lower() for term in required) / len(required)
# Check length (too short = unhelpful, too long = confusing)
word_count = len(output.split())
metrics['length_appropriate'] = 50 <= word_count <= 200
return metrics
5. Traditional NLP metrics
For summarization, translation, generation tasks.
BLEU: Measures n-gram overlap (originally for translation)
ROUGE: Measures recall of n-grams (originally for summarization)
F1: Precision and recall for classification tasks
Example:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'])
scores = scorer.score(reference_summary, generated_summary)
Caveat: These correlate poorly with human judgment for creative text. Use as weak signals, not primary metrics.
Automating evaluations
Manual evals don't scale. Automate them to run continuously.
Evaluation pipeline
# eval_pipeline.py
import json
from typing import List, Dict
def load_golden_dataset(path: str) -> List[Dict]:
"""Load test cases from JSON"""
with open(path) as f:
return json.load(f)
def run_model(input: str) -> str:
"""Your AI system"""
# Replace with your actual model call
return your_ai_system.generate(input)
def evaluate_case(case: Dict) -> Dict:
"""Run one test case"""
output = run_model(case['input'])
# Run evaluators
scores = {
'exact_match': exact_match(output, case.get('expected_output')),
'semantic_sim': semantic_similarity(output, case.get('expected_output')),
'llm_judge': llm_as_judge(case['input'], output, case.get('rubric')),
'latency_ms': measure_latency(case['input']),
}
return {
'case_id': case['id'],
'input': case['input'],
'output': output,
'scores': scores,
'passed': scores['llm_judge']['overall'] >= 3.5
}
def run_eval_suite(dataset_path: str) -> Dict:
"""Run full evaluation"""
cases = load_golden_dataset(dataset_path)
results = [evaluate_case(case) for case in cases]
# Aggregate metrics
pass_rate = sum(r['passed'] for r in results) / len(results)
avg_scores = {
'pass_rate': pass_rate,
'avg_latency': sum(r['scores']['latency_ms'] for r in results) / len(results),
# ... other aggregations
}
return {
'results': results,
'summary': avg_scores
}
if __name__ == '__main__':
results = run_eval_suite('datasets/golden_v1.json')
# Save results
with open('eval_results.json', 'w') as f:
json.dump(results, f, indent=2)
# Print summary
print(f"Pass rate: {results['summary']['pass_rate']:.1%}")
CI/CD integration
Run evals automatically on every code change.
Example: GitHub Actions
name: Run Evaluations
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluations
run: python eval_pipeline.py
- name: Check pass threshold
run: |
pass_rate=$(jq '.summary.pass_rate' eval_results.json)
if (( $(echo "$pass_rate < 0.85" | bc -l) )); then
echo "Pass rate below threshold: $pass_rate"
exit 1
fi
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: eval-results
path: eval_results.json
Regression testing
Track performance over time. Flag when quality drops.
# regression_check.py
def check_regressions(current_results, baseline_results, threshold=0.05):
"""Alert if scores drop significantly"""
regressions = []
for metric in ['accuracy', 'relevance', 'completeness']:
current = current_results['summary'][metric]
baseline = baseline_results['summary'][metric]
if current < baseline - threshold:
regressions.append({
'metric': metric,
'current': current,
'baseline': baseline,
'delta': current - baseline
})
return regressions
Debugging failures
Evals catch problems. Now what?
Error analysis workflow
- Group failures by type: Extract common patterns
- Prioritize by impact: How many users hit this? How bad is it?
- Root cause: Is it the prompt? Retrieval? Model? Data?
- Fix and re-eval: Validate the fix doesn't break other cases
Failure categorization example
def categorize_failures(results):
categories = {
'factual_errors': [],
'irrelevant': [],
'incomplete': [],
'safety_issues': [],
'formatting': [],
'other': []
}
for result in results:
if not result['passed']:
# Categorize based on which dimension failed
scores = result['scores']['llm_judge']
if scores['accuracy'] < 3:
categories['factual_errors'].append(result)
elif scores['relevance'] < 3:
categories['irrelevant'].append(result)
elif scores['completeness'] < 3:
categories['incomplete'].append(result)
# ... etc
return categories
Balancing cost, speed, and quality
Evaluation trade-offs:
Human eval:
- Quality: High
- Speed: Slow (hours/days)
- Cost: Expensive ($10-50/hour)
- When: Critical launches, safety validation
LLM-as-judge:
- Quality: Good (with tuning)
- Speed: Medium (seconds/case)
- Cost: Medium ($0.01-0.10/case)
- When: Continuous testing, most production evals
Automated metrics:
- Quality: Weak signal
- Speed: Fast (milliseconds)
- Cost: Cheap (negligible)
- When: Quick feedback, unit tests
Hybrid approach:
- Automated metrics for fast feedback (every commit)
- LLM-as-judge for nightly runs (full test suite)
- Human eval for releases (sample of cases)
Tools and frameworks
Weights & Biases: Track eval runs, compare experiments, visualize metrics over time. Great for ML teams.
LangSmith: LangChain's eval platform. Built-in LLM-as-judge, dataset management, tracing.
Braintrust: Evals-as-code platform. Version datasets, compare runs, CI/CD integration.
PromptFoo: Open-source CLI for prompt testing. Supports custom metrics, LLM-as-judge.
Ragas: Python library for RAG evaluation. Measures retrieval quality, answer faithfulness, relevance.
Custom pipelines: For full control, build your own. Easier than you thinkājust test cases + scoring logic + storage.
Practical example: Content moderation
Goal: Evaluate a toxicity filter for user comments.
Golden dataset:
[
{"text": "This is a great post!", "label": "safe"},
{"text": "You're an idiot", "label": "toxic"},
{"text": "I disagree with your point", "label": "safe"},
{"text": "[violent threat]", "label": "toxic"}
]
Rubric:
- Precision: What % of flagged comments are actually toxic?
- Recall: What % of toxic comments are caught?
- F1: Harmonic mean of precision and recall
Eval code:
from sklearn.metrics import precision_recall_fscore_support
def eval_moderation(model, dataset):
predictions = [model.predict(case['text']) for case in dataset]
labels = [case['label'] for case in dataset]
precision, recall, f1, _ = precision_recall_fscore_support(
labels, predictions, average='binary', pos_label='toxic'
)
return {
'precision': precision,
'recall': recall,
'f1': f1
}
CI integration: Run on every model update. Alert if F1 drops below 0.90.
Use responsibly
- Don't overfit to evals: High eval scores don't guarantee good user experience
- Audit for bias: Check if evals fairly represent all user groups
- Version everything: Dataset, rubric, code, model
- Human spot-checks: Regularly validate that LLM-as-judge aligns with human judgment
- Monitor production: Evals are pre-flight checks, not substitutes for live monitoring
What's next?
- Evaluating AI Answers: Learn to spot hallucinations and check accuracy
- Retrieval 201: Evaluate and improve RAG retrieval quality
- Prompting 201: Techniques for improving prompt robustness
- Guardrails and Policy: Enforce safety constraints in production
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Related Guides
Advanced AI Evaluation Frameworks
AdvancedBuild comprehensive evaluation systems: automated testing, human-in-the-loop, LLM-as-judge, and continuous monitoring.
Evaluating AI Answers (Hallucinations, Checks, and Evidence)
IntermediateHow to spot when AI gets it wrong. Practical techniques to verify accuracy, detect hallucinations, and build trust in AI outputs.
Advanced Prompt Optimization
AdvancedSystematically optimize prompts: automated testing, genetic algorithms, prompt compression, and performance tuning.