TL;DR

Advanced evaluation combines automated metrics, LLM-based judging, human review, and production monitoring. Build comprehensive test suites covering accuracy, safety, and edge cases.

Evaluation dimensions

Accuracy: Correctness of outputs
Relevance: On-topic, addresses query
Coherence: Logical, well-structured
Safety: No harmful content
Groundedness: Based on provided context
Consistency: Similar inputs → similar outputs

LLM-as-judge

Use strong model (GPT-4) to evaluate weaker model outputs:

  • Define rubric
  • Provide reference answers (optional)
  • LLM scores responses (1-5 or pass/fail)

Advantages: Scalable, nuanced
Limitations: Judge model biases, not perfect correlation with humans

Human evaluation

When needed:

  • Initial benchmarking
  • Periodic audits
  • Edge cases
  • Sensitive applications

Best practices:

  • Clear rubrics
  • Multiple raters
  • Inter-rater agreement
  • Representative samples

Automated test suites

Build datasets with expected outputs:

  • Golden test sets
  • Regression tests
  • Adversarial examples
  • Edge cases

Run on every model change, track metrics over time.

A/B testing in production

Compare models or prompts on real traffic:

  • Random assignment
  • Track business metrics
  • Statistical significance

Continuous monitoring

  • Sample production outputs
  • Automated scoring
  • Alert on degradation
  • Human review of failures

Evaluation frameworks

  • HELM: Holistic evaluation (Stanford)
  • OpenAI Evals: Open-source eval framework
  • LangSmith: LangChain evaluation
  • Custom: Build on your infrastructure