Skip to main content
📋

LLM Evaluation Rubric

Score and compare AI model outputs

1 page·480 KB·CC-BY 4.0
evaluationtestingqualityrubriccomparisonsystematic

View Resource

No email required. No signup. View online or print as PDF.

View Full Resource →

What's included

  • 7-criteria rubric for systematic evaluation
  • 1-5 scoring scale with clear descriptions
  • Works for any AI model or use case
  • Printable scorecard format
  • Perfect for A/B testing prompts or models
  • Includes scoring examples and interpretation guide

Why you need this

How do you know if your AI outputs are good? Most people use vague criteria like "seems fine" or "I like this one better." This rubric gives you a systematic, repeatable way to evaluate quality.

Perfect for:

  • Teams comparing multiple AI models (ChatGPT vs Claude vs Gemini)
  • Developers A/B testing prompts or parameters
  • QA teams reviewing AI-generated content
  • Researchers running evaluations or benchmarks

What's inside

The 7-Criteria Rubric

Each criterion has a 1-5 scoring scale with clear descriptions:

1. Accuracy

What it measures: Does the AI provide correct, factual information?

  • 1 = Major errors: Multiple hallucinations or false claims
  • 3 = Mostly accurate: Minor errors or outdated info
  • 5 = Fully accurate: No errors, all facts verified

2. Relevance

What it measures: Does the output directly address the prompt?

  • 1 = Off-topic: AI misunderstood or ignored the request
  • 3 = Somewhat relevant: Partial answer with tangents
  • 5 = Highly relevant: Directly answers the question, no filler

3. Clarity

What it measures: Is the output easy to understand?

  • 1 = Confusing: Jargon-heavy, poorly structured
  • 3 = Adequate: Clear but wordy or awkwardly phrased
  • 5 = Excellent: Concise, well-organized, easy to read

4. Completeness

What it measures: Does the output cover everything it should?

  • 1 = Incomplete: Missing key information or steps
  • 3 = Adequate: Covers basics but lacks depth
  • 5 = Comprehensive: Thorough, with examples and context

5. Tone & Style

What it measures: Does the output match the desired tone (professional, friendly, technical, etc.)?

  • 1 = Wrong tone: Too formal, too casual, or inappropriate
  • 3 = Acceptable: Close but not perfect
  • 5 = Perfect match: Exactly the right voice and style

6. Usefulness

What it measures: Can you actually use this output without major edits?

  • 1 = Not usable: Requires complete rewrite
  • 3 = Needs edits: 25-50% revision required
  • 5 = Ready to use: Minimal or no edits needed

7. Safety & Ethics

What it measures: Is the output safe, unbiased, and appropriate?

  • 1 = Unsafe: Harmful, biased, or violates policies
  • 3 = Acceptable: Minor concerns (e.g., slight bias)
  • 5 = Excellent: No safety or ethical issues

Scoring Guide

  • Total Score: 28-35 = Excellent output
  • Total Score: 21-27 = Good output (minor improvements needed)
  • Total Score: 14-20 = Fair output (significant edits required)
  • Total Score: 7-13 = Poor output (consider reprompting or switching models)

Blank Scorecard

Print the blank scorecard to quickly score multiple outputs and compare them side-by-side.

How to use it

For Model Comparison:

  1. Run the same prompt through ChatGPT, Claude, and Gemini
  2. Score each output using the rubric
  3. Compare total scores to find the best model for your use case

For Prompt Testing:

  1. Test 3-5 different prompt variations
  2. Score outputs to see which prompt performs best
  3. Iterate on the winner

For Quality Assurance:

  1. Use the rubric as a checklist before publishing AI-generated content
  2. Require minimum scores (e.g., "must score 4+ on Accuracy and Safety")
  3. Track scores over time to measure improvement

Real-world example: Comparing Models

Task: Generate a 3-paragraph summary of a 10-page report

ChatGPT Output:

  • Accuracy: 4 (one minor date error)
  • Relevance: 5 (directly summarized the report)
  • Clarity: 5 (concise and well-structured)
  • Completeness: 3 (missed one key finding)
  • Tone: 4 (slightly too casual)
  • Usefulness: 4 (needs minor edits)
  • Safety: 5 (no issues)
  • Total: 30/35 (Excellent)

Claude Output:

  • Accuracy: 5 (no errors)
  • Relevance: 5 (perfectly on-topic)
  • Clarity: 5 (excellent structure)
  • Completeness: 5 (covered all key points)
  • Tone: 5 (professional and appropriate)
  • Usefulness: 5 (ready to use)
  • Safety: 5 (no issues)
  • Total: 35/35 (Excellent)

Winner: Claude (for this specific task)

Want to go deeper?

This rubric is your evaluation framework. For advanced testing strategies and statistical analysis:

License & Attribution

This resource is licensed under Creative Commons Attribution 4.0 (CC-BY). You're free to:

  • Share with your team or research group
  • Print for evaluation sessions
  • Adapt criteria for your specific use case

Just include this attribution:

"LLM Evaluation Rubric" by Field Guide to AI (fieldguidetoai.com) is licensed under CC BY 4.0

Download now

Click below for instant access. No signup required.

Related Guides

Key Terms

Ready to view?

Access your free LLM Evaluation Rubric now. No forms, no wait—view online or print as PDF.

View Full Resource →

Licensed under CC-BY 4.0 · Free to share and adapt with attribution