Skip to main content
BETAThis is a new design — give feedback

Evaluation (Evals)

Also known as: Evals, Model Evaluation, Testing

In one sentence

Systematically testing an AI system to measure how well it performs on specific tasks, criteria, or safety requirements.

Explain like I'm 12

Like giving AI a report card—running lots of tests to see if it gives good answers, stays safe, and does what you want. Just like exams at school, evals show where the AI is strong and where it needs to improve.

In context

Evaluations are essential throughout the AI lifecycle. Before deploying a chatbot, a company might run hundreds of test conversations to check accuracy, tone, and safety. Common eval types include automated benchmarks (standardised tests like MMLU that compare models), human evaluation (people rate outputs for quality), and A/B testing (comparing two model versions with real users). Tools like OpenAI's Evals framework and LangSmith help teams run evals at scale. Companies building AI products typically run evals after every model update to catch regressions.

See also

Related Guides

Learn more about Evaluation (Evals) in these guides: