Why you need this
How do you know if your AI outputs are good? Most people use vague criteria like "seems fine" or "I like this one better." This rubric gives you a systematic, repeatable way to evaluate quality.
Perfect for:
- Teams comparing multiple AI models (ChatGPT vs Claude vs Gemini)
- Developers A/B testing prompts or parameters
- QA teams reviewing AI-generated content
- Researchers running evaluations or benchmarks
What's inside
The 7-Criteria Rubric
Each criterion has a 1-5 scoring scale with clear descriptions:
1. Accuracy
What it measures: Does the AI provide correct, factual information?
- 1 = Major errors: Multiple hallucinations or false claims
- 3 = Mostly accurate: Minor errors or outdated info
- 5 = Fully accurate: No errors, all facts verified
2. Relevance
What it measures: Does the output directly address the prompt?
- 1 = Off-topic: AI misunderstood or ignored the request
- 3 = Somewhat relevant: Partial answer with tangents
- 5 = Highly relevant: Directly answers the question, no filler
3. Clarity
What it measures: Is the output easy to understand?
- 1 = Confusing: Jargon-heavy, poorly structured
- 3 = Adequate: Clear but wordy or awkwardly phrased
- 5 = Excellent: Concise, well-organized, easy to read
4. Completeness
What it measures: Does the output cover everything it should?
- 1 = Incomplete: Missing key information or steps
- 3 = Adequate: Covers basics but lacks depth
- 5 = Comprehensive: Thorough, with examples and context
5. Tone & Style
What it measures: Does the output match the desired tone (professional, friendly, technical, etc.)?
- 1 = Wrong tone: Too formal, too casual, or inappropriate
- 3 = Acceptable: Close but not perfect
- 5 = Perfect match: Exactly the right voice and style
6. Usefulness
What it measures: Can you actually use this output without major edits?
- 1 = Not usable: Requires complete rewrite
- 3 = Needs edits: 25-50% revision required
- 5 = Ready to use: Minimal or no edits needed
7. Safety & Ethics
What it measures: Is the output safe, unbiased, and appropriate?
- 1 = Unsafe: Harmful, biased, or violates policies
- 3 = Acceptable: Minor concerns (e.g., slight bias)
- 5 = Excellent: No safety or ethical issues
Scoring Guide
- Total Score: 28-35 = Excellent output
- Total Score: 21-27 = Good output (minor improvements needed)
- Total Score: 14-20 = Fair output (significant edits required)
- Total Score: 7-13 = Poor output (consider reprompting or switching models)
Blank Scorecard
Print the blank scorecard to quickly score multiple outputs and compare them side-by-side.
How to use it
For Model Comparison:
- Run the same prompt through ChatGPT, Claude, and Gemini
- Score each output using the rubric
- Compare total scores to find the best model for your use case
For Prompt Testing:
- Test 3-5 different prompt variations
- Score outputs to see which prompt performs best
- Iterate on the winner
For Quality Assurance:
- Use the rubric as a checklist before publishing AI-generated content
- Require minimum scores (e.g., "must score 4+ on Accuracy and Safety")
- Track scores over time to measure improvement
Real-world example: Comparing Models
Task: Generate a 3-paragraph summary of a 10-page report
ChatGPT Output:
- Accuracy: 4 (one minor date error)
- Relevance: 5 (directly summarized the report)
- Clarity: 5 (concise and well-structured)
- Completeness: 3 (missed one key finding)
- Tone: 4 (slightly too casual)
- Usefulness: 4 (needs minor edits)
- Safety: 5 (no issues)
- Total: 30/35 (Excellent)
Claude Output:
- Accuracy: 5 (no errors)
- Relevance: 5 (perfectly on-topic)
- Clarity: 5 (excellent structure)
- Completeness: 5 (covered all key points)
- Tone: 5 (professional and appropriate)
- Usefulness: 5 (ready to use)
- Safety: 5 (no issues)
- Total: 35/35 (Excellent)
Winner: Claude (for this specific task)
Want to go deeper?
This rubric is your evaluation framework. For advanced testing strategies and statistical analysis:
- Prompt Engineering Basics β Improve your prompts before evaluating
- AI Safety Basics β Understanding safety and ethics criteria
- Glossary: Hallucination β Measuring accuracy issues
License & Attribution
This resource is licensed under Creative Commons Attribution 4.0 (CC-BY). You're free to:
- Share with your team or research group
- Print for evaluation sessions
- Adapt criteria for your specific use case
Just include this attribution:
"LLM Evaluation Rubric" by Field Guide to AI (fieldguidetoai.com) is licensed under CC BY 4.0
Download now
Click below for instant access. No signup required.