TL;DR

Evaluations are automated tests that measure AI quality. Build a golden dataset (test cases with expected outputs), define rubrics (scoring criteria), and run them continuously. Use LLM-as-judge for scalable evaluation, regression tests to catch failures, and custom metrics for domain-specific quality. Good evals are the difference between shipping confidently and crossing your fingers.

Why rigorous evaluation matters

You've deployed your AI feature. It works great in demos. Then production traffic hits and users report nonsensical answers, missed information, or inappropriate tone. What happened?

Manual spot-checks don't scale. You can't test every edge case by hand. You need systematic evaluation—automated tests that run on every code change, catching regressions before they reach users.

Without evals, you're flying blind. With them, you can iterate confidently, knowing exactly when you've improved or broken something.

Building golden datasets

A golden dataset is your test suite for AI: input examples paired with expected outputs or quality scores.

What makes a good golden dataset?

1. Representative: Covers real user scenarios, not just easy cases
2. Diverse: Edge cases, different input types, varying difficulty
3. Labeled: Every example has a ground truth (correct answer, quality score, or both)
4. Versioned: Track changes over time

Sourcing examples

Start with production data. Sample real user queries (remove PII first). These reflect actual usage patterns you need to handle.

Add adversarial cases. Jailbreak attempts, gibberish, hostile input. Test safety guardrails.

Include edge cases. Corner cases that failed before, ambiguous queries, multi-step reasoning.

Balance difficulty. Mix easy wins (to catch catastrophic failures) with hard problems (to measure improvement).

Aim for 50-200 examples minimum. Too few and you miss patterns. Thousands is better but start small.

Labeling strategies

Human labeling: Gold standard but expensive. Use for critical evaluations (safety, legal compliance).

AI-assisted labeling: Have an LLM propose labels, humans review. Faster and cheaper.

Synthetic data: Generate examples programmatically. Good for edge cases (dates, numbers, formats) but lacks real-world messiness.

Example: Customer support bot

{
  "id": "001",
  "input": "How do I reset my password?",
  "expected_topics": ["password_reset", "authentication"],
  "expected_sentiment": "helpful",
  "must_include": ["password reset link", "email"],
  "must_not_include": ["billing", "refund"],
  "quality_score": 5
}

Versioning

Track dataset versions like code:

  • v1.0: Initial 50 examples
  • v1.1: Added 20 edge cases from production failures
  • v2.0: Relabeled with new rubric

Use git or a dataset versioning tool (DVC, Weights & Biases datasets). Never evaluate on old versions without documenting it.

Defining evaluation rubrics

A rubric is your scoring criteria: how you judge quality.

Common rubric dimensions

Accuracy: Is the answer factually correct?

  • Binary (right/wrong) or scalar (1-5)
  • Example: "Did the bot correctly identify the customer's issue?"

Relevance: Does it answer the question?

  • Example: User asks about refunds, bot talks about shipping

Completeness: Does it cover all aspects?

  • Example: "Reset password" should mention both email link and backup codes

Safety: Does it avoid harmful, biased, or inappropriate content?

  • Red lines: toxicity, PII leakage, jailbreak success

Style: Does it match desired tone, length, format?

  • Example: "Friendly but professional, under 100 words"

Latency: Did it respond fast enough?

  • Example: 95th percentile under 2 seconds

Rubric example: Code generation

metrics:
  - name: correctness
    type: binary
    criteria: "Code runs without errors and produces correct output"

  - name: best_practices
    type: scale_1_5
    criteria: "Follows language conventions, handles errors, includes comments"

  - name: security
    type: binary
    criteria: "No SQL injection, XSS, or other vulnerabilities"

  - name: efficiency
    type: scale_1_5
    criteria: "Appropriate algorithm choice, no obvious performance issues"

Multi-dimensional scoring

Don't rely on a single score. Track multiple dimensions:

{
  "accuracy": 4,      # Got the right answer
  "relevance": 5,     # Directly addressed the question
  "completeness": 3,  # Missed one edge case
  "safety": 5,        # No issues
  "style": 4,         # Slightly verbose
  "overall": 4.2      # Weighted average
}

Evaluation strategies

1. Exact match

Does the output exactly match the expected answer?

When to use: Structured output (JSON, code, classifications), fact lookup

Example:

def exact_match(output, expected):
    return output.strip() == expected.strip()

Pros: Simple, unambiguous
Cons: Brittle—minor variations fail even if semantically correct

2. Semantic similarity

How close is the meaning?

Method: Embed both output and expected answer, measure cosine similarity.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(output, expected):
    emb1 = model.encode(output)
    emb2 = model.encode(expected)
    similarity = cosine_similarity(emb1, emb2)
    return similarity

Pros: Handles paraphrasing, focuses on meaning
Cons: Can miss factual errors if phrasing is similar

3. LLM-as-judge

Use an LLM to evaluate another LLM's output.

When to use: Open-ended text, complex quality criteria, nuanced evaluation

Example prompt:

You are evaluating an AI assistant's answer.

Question: {question}
Answer: {answer}
Expected topics: {expected_topics}

Rate the answer on:
1. Accuracy (1-5): Is it factually correct?
2. Relevance (1-5): Does it answer the question?
3. Completeness (1-5): Does it cover all necessary points?

Output as JSON:
{
  "accuracy": <score>,
  "relevance": <score>,
  "completeness": <score>,
  "reasoning": "<brief explanation>"
}

Pros: Flexible, understands nuance, scales better than human eval
Cons: Can be biased, inconsistent, expensive

Tips:

  • Use strong models (GPT-4, Claude) as judges
  • Provide clear rubrics in the prompt
  • Validate with human spot-checks
  • Use temperature=0 for consistency

4. Custom metrics

Domain-specific quality measures.

Example: Customer support

def eval_support_response(output, context):
    metrics = {}

    # Check if solution steps are included
    metrics['has_solution'] = bool(re.search(r'\d+\.', output))

    # Check tone (using sentiment analyzer)
    metrics['tone'] = get_sentiment(output)  # 'positive', 'neutral', 'negative'

    # Check for required elements
    required = ['ticket number', 'email', 'resolution']
    metrics['completeness'] = sum(term in output.lower() for term in required) / len(required)

    # Check length (too short = unhelpful, too long = confusing)
    word_count = len(output.split())
    metrics['length_appropriate'] = 50 <= word_count <= 200

    return metrics

5. Traditional NLP metrics

For summarization, translation, generation tasks.

BLEU: Measures n-gram overlap (originally for translation)
ROUGE: Measures recall of n-grams (originally for summarization)
F1: Precision and recall for classification tasks

Example:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'])
scores = scorer.score(reference_summary, generated_summary)

Caveat: These correlate poorly with human judgment for creative text. Use as weak signals, not primary metrics.

Automating evaluations

Manual evals don't scale. Automate them to run continuously.

Evaluation pipeline

# eval_pipeline.py

import json
from typing import List, Dict

def load_golden_dataset(path: str) -> List[Dict]:
    """Load test cases from JSON"""
    with open(path) as f:
        return json.load(f)

def run_model(input: str) -> str:
    """Your AI system"""
    # Replace with your actual model call
    return your_ai_system.generate(input)

def evaluate_case(case: Dict) -> Dict:
    """Run one test case"""
    output = run_model(case['input'])

    # Run evaluators
    scores = {
        'exact_match': exact_match(output, case.get('expected_output')),
        'semantic_sim': semantic_similarity(output, case.get('expected_output')),
        'llm_judge': llm_as_judge(case['input'], output, case.get('rubric')),
        'latency_ms': measure_latency(case['input']),
    }

    return {
        'case_id': case['id'],
        'input': case['input'],
        'output': output,
        'scores': scores,
        'passed': scores['llm_judge']['overall'] >= 3.5
    }

def run_eval_suite(dataset_path: str) -> Dict:
    """Run full evaluation"""
    cases = load_golden_dataset(dataset_path)
    results = [evaluate_case(case) for case in cases]

    # Aggregate metrics
    pass_rate = sum(r['passed'] for r in results) / len(results)
    avg_scores = {
        'pass_rate': pass_rate,
        'avg_latency': sum(r['scores']['latency_ms'] for r in results) / len(results),
        # ... other aggregations
    }

    return {
        'results': results,
        'summary': avg_scores
    }

if __name__ == '__main__':
    results = run_eval_suite('datasets/golden_v1.json')

    # Save results
    with open('eval_results.json', 'w') as f:
        json.dump(results, f, indent=2)

    # Print summary
    print(f"Pass rate: {results['summary']['pass_rate']:.1%}")

CI/CD integration

Run evals automatically on every code change.

Example: GitHub Actions

name: Run Evaluations

on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run evaluations
        run: python eval_pipeline.py
      - name: Check pass threshold
        run: |
          pass_rate=$(jq '.summary.pass_rate' eval_results.json)
          if (( $(echo "$pass_rate < 0.85" | bc -l) )); then
            echo "Pass rate below threshold: $pass_rate"
            exit 1
          fi
      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: eval-results
          path: eval_results.json

Regression testing

Track performance over time. Flag when quality drops.

# regression_check.py

def check_regressions(current_results, baseline_results, threshold=0.05):
    """Alert if scores drop significantly"""
    regressions = []

    for metric in ['accuracy', 'relevance', 'completeness']:
        current = current_results['summary'][metric]
        baseline = baseline_results['summary'][metric]

        if current < baseline - threshold:
            regressions.append({
                'metric': metric,
                'current': current,
                'baseline': baseline,
                'delta': current - baseline
            })

    return regressions

Debugging failures

Evals catch problems. Now what?

Error analysis workflow

  1. Group failures by type: Extract common patterns
  2. Prioritize by impact: How many users hit this? How bad is it?
  3. Root cause: Is it the prompt? Retrieval? Model? Data?
  4. Fix and re-eval: Validate the fix doesn't break other cases

Failure categorization example

def categorize_failures(results):
    categories = {
        'factual_errors': [],
        'irrelevant': [],
        'incomplete': [],
        'safety_issues': [],
        'formatting': [],
        'other': []
    }

    for result in results:
        if not result['passed']:
            # Categorize based on which dimension failed
            scores = result['scores']['llm_judge']
            if scores['accuracy'] < 3:
                categories['factual_errors'].append(result)
            elif scores['relevance'] < 3:
                categories['irrelevant'].append(result)
            elif scores['completeness'] < 3:
                categories['incomplete'].append(result)
            # ... etc

    return categories

Balancing cost, speed, and quality

Evaluation trade-offs:

Human eval:

  • Quality: High
  • Speed: Slow (hours/days)
  • Cost: Expensive ($10-50/hour)
  • When: Critical launches, safety validation

LLM-as-judge:

  • Quality: Good (with tuning)
  • Speed: Medium (seconds/case)
  • Cost: Medium ($0.01-0.10/case)
  • When: Continuous testing, most production evals

Automated metrics:

  • Quality: Weak signal
  • Speed: Fast (milliseconds)
  • Cost: Cheap (negligible)
  • When: Quick feedback, unit tests

Hybrid approach:

  1. Automated metrics for fast feedback (every commit)
  2. LLM-as-judge for nightly runs (full test suite)
  3. Human eval for releases (sample of cases)

Tools and frameworks

Weights & Biases: Track eval runs, compare experiments, visualize metrics over time. Great for ML teams.

LangSmith: LangChain's eval platform. Built-in LLM-as-judge, dataset management, tracing.

Braintrust: Evals-as-code platform. Version datasets, compare runs, CI/CD integration.

PromptFoo: Open-source CLI for prompt testing. Supports custom metrics, LLM-as-judge.

Ragas: Python library for RAG evaluation. Measures retrieval quality, answer faithfulness, relevance.

Custom pipelines: For full control, build your own. Easier than you think—just test cases + scoring logic + storage.

Practical example: Content moderation

Goal: Evaluate a toxicity filter for user comments.

Golden dataset:

[
  {"text": "This is a great post!", "label": "safe"},
  {"text": "You're an idiot", "label": "toxic"},
  {"text": "I disagree with your point", "label": "safe"},
  {"text": "[violent threat]", "label": "toxic"}
]

Rubric:

  • Precision: What % of flagged comments are actually toxic?
  • Recall: What % of toxic comments are caught?
  • F1: Harmonic mean of precision and recall

Eval code:

from sklearn.metrics import precision_recall_fscore_support

def eval_moderation(model, dataset):
    predictions = [model.predict(case['text']) for case in dataset]
    labels = [case['label'] for case in dataset]

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='binary', pos_label='toxic'
    )

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

CI integration: Run on every model update. Alert if F1 drops below 0.90.

Use responsibly

  • Don't overfit to evals: High eval scores don't guarantee good user experience
  • Audit for bias: Check if evals fairly represent all user groups
  • Version everything: Dataset, rubric, code, model
  • Human spot-checks: Regularly validate that LLM-as-judge aligns with human judgment
  • Monitor production: Evals are pre-flight checks, not substitutes for live monitoring

What&#39;s next?

  • Evaluating AI Answers: Learn to spot hallucinations and check accuracy
  • Retrieval 201: Evaluate and improve RAG retrieval quality
  • Prompting 201: Techniques for improving prompt robustness
  • Guardrails and Policy: Enforce safety constraints in production