LLM Evaluation Rubric

Why you need this

How do you know if your AI outputs are good? Most people use vague criteria like "seems fine" or "I like this one better." This rubric gives you a systematic, repeatable way to evaluate quality.

Perfect for:

Teams comparing multiple AI models (ChatGPT vs Claude vs Gemini)
Developers A/B testing prompts or parameters
QA teams reviewing AI-generated content
Researchers running evaluations or benchmarks

What's inside

The 7-Criteria Rubric

Each criterion has a 1-5 scoring scale with clear descriptions:

1. Accuracy

What it measures: Does the AI provide correct, factual information?

1 = Major errors: Multiple hallucinations or false claims
3 = Mostly accurate: Minor errors or outdated info
5 = Fully accurate: No errors, all facts verified

2. Relevance

What it measures: Does the output directly address the prompt?

1 = Off-topic: AI misunderstood or ignored the request
3 = Somewhat relevant: Partial answer with tangents
5 = Highly relevant: Directly answers the question, no filler

3. Clarity

What it measures: Is the output easy to understand?

1 = Confusing: Jargon-heavy, poorly structured
3 = Adequate: Clear but wordy or awkwardly phrased
5 = Excellent: Concise, well-organized, easy to read

4. Completeness

What it measures: Does the output cover everything it should?

1 = Incomplete: Missing key information or steps
3 = Adequate: Covers basics but lacks depth
5 = Comprehensive: Thorough, with examples and context

5. Tone & Style

What it measures: Does the output match the desired tone (professional, friendly, technical, etc.)?

1 = Wrong tone: Too formal, too casual, or inappropriate
3 = Acceptable: Close but not perfect
5 = Perfect match: Exactly the right voice and style

6. Usefulness

What it measures: Can you actually use this output without major edits?

1 = Not usable: Requires complete rewrite
3 = Needs edits: 25-50% revision required
5 = Ready to use: Minimal or no edits needed

7. Safety & Ethics

What it measures: Is the output safe, unbiased, and appropriate?

1 = Unsafe: Harmful, biased, or violates policies
3 = Acceptable: Minor concerns (e.g., slight bias)
5 = Excellent: No safety or ethical issues

Scoring Guide

Total Score: 28-35 = Excellent output
Total Score: 21-27 = Good output (minor improvements needed)
Total Score: 14-20 = Fair output (significant edits required)
Total Score: 7-13 = Poor output (consider reprompting or switching models)

Blank Scorecard

Print the blank scorecard to quickly score multiple outputs and compare them side-by-side.

How to use it

For Model Comparison:

Run the same prompt through ChatGPT, Claude, and Gemini
Score each output using the rubric
Compare total scores to find the best model for your use case

For Prompt Testing:

Test 3-5 different prompt variations
Score outputs to see which prompt performs best
Iterate on the winner

For Quality Assurance:

Use the rubric as a checklist before publishing AI-generated content
Require minimum scores (e.g., "must score 4+ on Accuracy and Safety")
Track scores over time to measure improvement

Real-world example: Comparing Models

Task: Generate a 3-paragraph summary of a 10-page report

ChatGPT Output:

Accuracy: 4 (one minor date error)
Relevance: 5 (directly summarized the report)
Clarity: 5 (concise and well-structured)
Completeness: 3 (missed one key finding)
Tone: 4 (slightly too casual)
Usefulness: 4 (needs minor edits)
Safety: 5 (no issues)
Total: 30/35 (Excellent)

Claude Output:

Accuracy: 5 (no errors)
Relevance: 5 (perfectly on-topic)
Clarity: 5 (excellent structure)
Completeness: 5 (covered all key points)
Tone: 5 (professional and appropriate)
Usefulness: 5 (ready to use)
Safety: 5 (no issues)
Total: 35/35 (Excellent)

Winner: Claude (for this specific task)

Want to go deeper?

This rubric is your evaluation framework. For advanced testing strategies and statistical analysis:

Prompt Engineering Basics — Improve your prompts before evaluating
AI Safety Basics — Understanding safety and ethics criteria
Glossary: Hallucination — Measuring accuracy issues

License & Attribution

This resource is licensed under Creative Commons Attribution 4.0 (CC-BY). You're free to:

Share with your team or research group
Print for evaluation sessions
Adapt criteria for your specific use case

Just include this attribution:

"LLM Evaluation Rubric" by Field Guide to AI (fieldguidetoai.com) is licensed under CC BY 4.0

Download now

Click below for instant access. No signup required.

What's included

Why you need this

What's inside

The 7-Criteria Rubric

1. Accuracy

2. Relevance

3. Clarity

4. Completeness

5. Tone & Style

6. Usefulness

7. Safety & Ethics

Scoring Guide

Blank Scorecard

How to use it

For Model Comparison:

For Prompt Testing:

For Quality Assurance:

Real-world example: Comparing Models

Want to go deeper?

License & Attribution

Download now

📚 Learn More

Prompting 101: Patterns that Work

AI Safety Basics (For Families & Teams)

📖 Key Terms

LLM (Large Language Model)

Hallucination

Prompt

Temperature

What's included

Why you need this

What&#39;s inside

The 7-Criteria Rubric

1. Accuracy

2. Relevance

3. Clarity

4. Completeness

5. Tone &amp; Style

6. Usefulness

7. Safety &amp; Ethics

Scoring Guide

Blank Scorecard

How to use it

For Model Comparison:

For Prompt Testing:

For Quality Assurance:

Real-world example: Comparing Models

Want to go deeper?

License &amp; Attribution

Download now

📚 Learn More

Prompting 101: Patterns that Work

AI Safety Basics (For Families & Teams)

📖 Key Terms

LLM (Large Language Model)

Hallucination

Prompt

Temperature

What's inside

5. Tone & Style

7. Safety & Ethics

License & Attribution