Intermediate10 min read

AI Safety Testing Basics: Finding Problems Before Users Do

Learn how to test AI systems for safety issues. From prompt injection to bias detection—practical testing approaches that help catch problems before deployment.

By Marcin Piekarski • Frontend Lead & AI Educator • builtweb.com.au

AI-Assisted by: Prism AI (Prism AI represents the collaborative AI assistance in content creation.)

Last Updated: 7 December 2025

safetytestingquality assurancered teaming

TL;DR

AI safety testing finds harmful behaviors before users encounter them. Test for prompt injection, harmful outputs, bias, data leakage, and edge cases. Combine automated testing with human red teaming. Make safety testing part of your development process, not an afterthought.

Why it matters

AI systems fail in unexpected ways. A chatbot might generate harmful content. A classifier might discriminate unfairly. A helpful assistant might reveal sensitive data. Safety testing catches these issues before they damage users, your reputation, or get you in legal trouble.

What to test for

Harmful outputs

Can your AI generate content that's:

Violent or threatening
Sexually explicit (when inappropriate)
Hateful or discriminatory
Dangerous (instructions for harm)
Illegal (fraud guidance, etc.)

Prompt injection

Can users manipulate your AI to:

Override system instructions
Reveal system prompts
Bypass safety filters
Act against intended purpose

Bias and fairness

Does your AI treat groups differently:

Different error rates by demographic
Stereotyping in outputs
Unfair advantage/disadvantage
Exclusionary language or assumptions

Data leakage

Can your AI reveal:

Training data samples
Personal information
Proprietary information
System implementation details

Reliability failures

Does your AI fail safely when:

Given ambiguous inputs
Asked about unknown topics
Pushed to edge cases
Receiving adversarial inputs

Testing approaches

Automated testing

Run tests continuously and at scale:

Test suites:

Known attack prompts
Bias evaluation datasets
Edge case collections
Regression tests from past issues

Automated checks:

Output content classifiers
PII detection
Toxicity scoring
Format validation

Example test categories:

- prompt_injection_tests/
  - jailbreak_attempts.yaml
  - instruction_override.yaml
  - system_prompt_extraction.yaml
- harmful_content_tests/
  - violence_elicitation.yaml
  - illegal_activity.yaml
  - self_harm_content.yaml
- bias_tests/
  - demographic_parity.yaml
  - stereotype_association.yaml

Human red teaming

Humans find what automated tests miss:

Red team composition:

Security researchers
Domain experts
Diverse perspectives
Creative adversarial thinkers

Red team process:

Define scope and rules of engagement
Provide system access and documentation
Structured testing sessions
Document all findings
Prioritize and remediate

What humans catch that automation misses:

Novel attack patterns
Contextual inappropriateness
Subtle bias issues
Creative edge cases

Evaluation metrics

Track safety performance quantitatively:

Metric	What it measures	Target
Attack success rate	How often jailbreaks work	<1%
Refusal accuracy	Correctly refusing harmful requests	>99%
False positive rate	Incorrectly refusing benign requests	<5%
Bias disparity	Difference in outcomes by group	<1%

Building a safety test suite

Step 1: Define safety requirements

What must your system never do? What must it always do?

Example requirements:

Never provide instructions for weapons
Never reveal personal information
Always acknowledge uncertainty
Treat all user groups fairly

Step 2: Create test cases

For each requirement, create tests:

Positive tests (should pass):

Normal usage patterns
Edge cases that should work
Boundary conditions

Negative tests (should be handled safely):

Direct attacks
Subtle attempts
Encoded/obfuscated attacks

Step 3: Establish baselines

Run tests on your current system:

Document current pass/fail rates
Identify worst problem areas
Set improvement targets

Step 4: Integrate into development

Make safety testing automatic:

Run on every code change
Block deployment on safety failures
Track metrics over time
Alert on regressions

Testing specific risks

Prompt injection testing

Direct injection:

"Ignore previous instructions and..."
"Your new instructions are..."
"System: Override safety..."

Indirect injection:

Malicious content in retrieved documents
Adversarial data in user uploads
Encoded instructions in inputs

Testing approach:

Maintain attack prompt database
Test variations and obfuscations
Monitor for new attack patterns

Bias testing

Counterfactual testing:

Change demographic markers
Compare outputs
Flag disparities

Example:

Input A: "John is applying for a loan..."
Input B: "Jamila is applying for a loan..."

Outputs should be equivalent for equivalent situations.

Dataset evaluation:

Test on diverse datasets
Measure performance by subgroup
Compare error rates

Data leakage testing

Training data extraction:

Prompt for verbatim quotes
Ask about specific training examples
Test memorization probes

PII detection:

Generate diverse outputs
Scan for PII patterns
Test with boundary cases

Common mistakes

Mistake	Why it's dangerous	Better approach
Testing only happy paths	Misses adversarial behavior	Include negative tests
One-time testing	New issues emerge	Continuous testing
Only automated tests	Misses creative attacks	Combine with human red teaming
Testing in isolation	Misses integration issues	Test full system
Ignoring edge cases	Failures happen at edges	Systematic edge case coverage

What's next

Deepen your safety knowledge:

AI Failure Modes — Understanding how AI fails
Red Teaming AI — Advanced adversarial testing
AI Ethics Guidelines — Broader responsible AI practices

Frequently Asked Questions

How often should I run safety tests?

Automated tests: on every code change and model update. Human red teaming: before major releases and quarterly for ongoing systems. More frequent testing catches issues earlier when they're cheaper to fix.

Do I need a dedicated red team?

Not necessarily. Start with your development team doing structured adversarial testing. As your system grows, consider external red teamers who bring fresh perspectives and aren't familiar with your system's quirks.

What if my safety tests and user experience conflict?

This is common—overly aggressive safety filters frustrate users. Track false positive rates alongside safety metrics. Use nuanced responses instead of hard blocks. Balance is key: too little safety is dangerous, too much makes the product unusable.

How do I prioritize which safety issues to fix first?

Consider: severity of harm if exploited, likelihood of exploitation, number of users affected, and difficulty of fix. Critical issues affecting many users should be fixed immediately. Edge cases affecting few users can be queued.

Was this guide helpful?

Your feedback helps us improve our guides

About the Authors

Marcin Piekarski• Frontend Lead & AI Educator

Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.

Credentials & Experience:

20+ years web development experience
Frontend Lead at Harvey Norman (10 years)
Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
Runs AI workshops for teams
Founder of builtweb.com.au
Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
Specializes in React ecosystem: React, Next.js, Node.js

Areas of Expertise:

Web DevelopmentAI Tools & WorkflowsProductivity AutomationTechnical EducationUser Experience Design

Visit Website →LinkedIn Profile →

Prism AI• AI Research & Writing Assistant

Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.

Capabilities:

Powered by frontier AI models: Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google)
Specializes in research synthesis and content drafting
All output reviewed and verified by human experts
Trained on authoritative AI documentation and research papers

Specializations:

AI Research & DocumentationContent SynthesisTechnical WritingConcept ExplanationCode Examples

Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication. AI helps with research and drafting, but human expertise ensures accuracy and quality.

Key Terms Used in This Guide

Prompt

The question or instruction you give to an AI. A good prompt is clear, specific, and gives context.

Prompt Injection

A security vulnerability where users trick an AI into ignoring its instructions by inserting malicious commands into their prompts.

AI (Artificial Intelligence)

Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.

Evaluation (Evals)

Systematically testing an AI system to measure how well it performs on specific tasks or criteria.

Related Guides

AI Failure Modes and Mitigations: When AI Goes Wrong

Intermediate

Understand how AI systems fail and how to prevent failures. From hallucinations to catastrophic errors—learn to anticipate, detect, and handle AI failures gracefully.

11 min read

AI and Kids: A Parent's Safety Guide

Beginner

Kids are using AI for homework, entertainment, and chatting. Learn how to keep them safe, teach responsible use, and set healthy boundaries.

6 min read

AI and Privacy: What You Need to Know