Evaluation (Evals)
Also known as: Evals, Model Evaluation, Testing
In one sentence
Systematically testing an AI system to measure how well it performs on specific tasks, criteria, or safety requirements.
Explain like I'm 12
Like giving AI a report card—running lots of tests to see if it gives good answers, stays safe, and does what you want. Just like exams at school, evals show where the AI is strong and where it needs to improve.
In context
Evaluations are essential throughout the AI lifecycle. Before deploying a chatbot, a company might run hundreds of test conversations to check accuracy, tone, and safety. Common eval types include automated benchmarks (standardised tests like MMLU that compare models), human evaluation (people rate outputs for quality), and A/B testing (comparing two model versions with real users). Tools like OpenAI's Evals framework and LangSmith help teams run evals at scale. Companies building AI products typically run evals after every model update to catch regressions.
See also
Related Guides
Learn more about Evaluation (Evals) in these guides:
What Are AI Evals? Understanding AI Evaluation
BeginnerLearn what AI evaluations (evals) are, why they matter, and how companies test AI systems to make sure they work correctly and safely.
7 min readAI Safety Testing Basics: Finding Problems Before Users Do
IntermediateLearn how to test AI systems for safety issues. From prompt injection to bias detection—practical testing approaches that help catch problems before deployment.
10 min readEvaluations 201: Golden Sets, Rubrics, and Automated Eval
AdvancedBuild rigorous evaluation systems for AI. Create golden datasets, define rubrics, automate testing, and measure improvements.
14 min read