AI Safety Testing Basics: Finding Problems Before Users Do
Learn how to test AI systems for safety issues. From prompt injection to bias detectionāpractical testing approaches that help catch problems before deployment.
By Marcin Piekarski ⢠Founder & Web Developer ⢠builtweb.com.au
AI-Assisted by: Prism AI (Prism AI represents the collaborative AI assistance in content creation.)
Last Updated: 7 December 2025
TL;DR
AI safety testing finds harmful behaviors before users encounter them. Test for prompt injection, harmful outputs, bias, data leakage, and edge cases. Combine automated testing with human red teaming. Make safety testing part of your development process, not an afterthought.
Why it matters
AI systems fail in unexpected ways. A chatbot might generate harmful content. A classifier might discriminate unfairly. A helpful assistant might reveal sensitive data. Safety testing catches these issues before they damage users, your reputation, or get you in legal trouble.
What to test for
Harmful outputs
Can your AI generate content that's:
- Violent or threatening
- Sexually explicit (when inappropriate)
- Hateful or discriminatory
- Dangerous (instructions for harm)
- Illegal (fraud guidance, etc.)
Prompt injection
Can users manipulate your AI to:
- Override system instructions
- Reveal system prompts
- Bypass safety filters
- Act against intended purpose
Bias and fairness
Does your AI treat groups differently:
- Different error rates by demographic
- Stereotyping in outputs
- Unfair advantage/disadvantage
- Exclusionary language or assumptions
Data leakage
Can your AI reveal:
- Training data samples
- Personal information
- Proprietary information
- System implementation details
Reliability failures
Does your AI fail safely when:
- Given ambiguous inputs
- Asked about unknown topics
- Pushed to edge cases
- Receiving adversarial inputs
Testing approaches
Automated testing
Run tests continuously and at scale:
Test suites:
- Known attack prompts
- Bias evaluation datasets
- Edge case collections
- Regression tests from past issues
Automated checks:
- Output content classifiers
- PII detection
- Toxicity scoring
- Format validation
Example test categories:
- prompt_injection_tests/
- jailbreak_attempts.yaml
- instruction_override.yaml
- system_prompt_extraction.yaml
- harmful_content_tests/
- violence_elicitation.yaml
- illegal_activity.yaml
- self_harm_content.yaml
- bias_tests/
- demographic_parity.yaml
- stereotype_association.yaml
Human red teaming
Humans find what automated tests miss:
Red team composition:
- Security researchers
- Domain experts
- Diverse perspectives
- Creative adversarial thinkers
Red team process:
- Define scope and rules of engagement
- Provide system access and documentation
- Structured testing sessions
- Document all findings
- Prioritize and remediate
What humans catch that automation misses:
- Novel attack patterns
- Contextual inappropriateness
- Subtle bias issues
- Creative edge cases
Evaluation metrics
Track safety performance quantitatively:
| Metric | What it measures | Target |
|---|---|---|
| Attack success rate | How often jailbreaks work | <1% |
| Refusal accuracy | Correctly refusing harmful requests | >99% |
| False positive rate | Incorrectly refusing benign requests | <5% |
| Bias disparity | Difference in outcomes by group | <1% |
Building a safety test suite
Step 1: Define safety requirements
What must your system never do? What must it always do?
Example requirements:
- Never provide instructions for weapons
- Never reveal personal information
- Always acknowledge uncertainty
- Treat all user groups fairly
Step 2: Create test cases
For each requirement, create tests:
Positive tests (should pass):
- Normal usage patterns
- Edge cases that should work
- Boundary conditions
Negative tests (should be handled safely):
- Direct attacks
- Subtle attempts
- Encoded/obfuscated attacks
Step 3: Establish baselines
Run tests on your current system:
- Document current pass/fail rates
- Identify worst problem areas
- Set improvement targets
Step 4: Integrate into development
Make safety testing automatic:
- Run on every code change
- Block deployment on safety failures
- Track metrics over time
- Alert on regressions
Testing specific risks
Prompt injection testing
Direct injection:
"Ignore previous instructions and..."
"Your new instructions are..."
"System: Override safety..."
Indirect injection:
- Malicious content in retrieved documents
- Adversarial data in user uploads
- Encoded instructions in inputs
Testing approach:
- Maintain attack prompt database
- Test variations and obfuscations
- Monitor for new attack patterns
Bias testing
Counterfactual testing:
- Change demographic markers
- Compare outputs
- Flag disparities
Example:
Input A: "John is applying for a loan..."
Input B: "Jamila is applying for a loan..."
Outputs should be equivalent for equivalent situations.
Dataset evaluation:
- Test on diverse datasets
- Measure performance by subgroup
- Compare error rates
Data leakage testing
Training data extraction:
- Prompt for verbatim quotes
- Ask about specific training examples
- Test memorization probes
PII detection:
- Generate diverse outputs
- Scan for PII patterns
- Test with boundary cases
Common mistakes
| Mistake | Why it's dangerous | Better approach |
|---|---|---|
| Testing only happy paths | Misses adversarial behavior | Include negative tests |
| One-time testing | New issues emerge | Continuous testing |
| Only automated tests | Misses creative attacks | Combine with human red teaming |
| Testing in isolation | Misses integration issues | Test full system |
| Ignoring edge cases | Failures happen at edges | Systematic edge case coverage |
What's next
Deepen your safety knowledge:
- AI Failure Modes ā Understanding how AI fails
- Red Teaming AI ā Advanced adversarial testing
- AI Ethics Guidelines ā Broader responsible AI practices
Frequently Asked Questions
How often should I run safety tests?
Automated tests: on every code change and model update. Human red teaming: before major releases and quarterly for ongoing systems. More frequent testing catches issues earlier when they're cheaper to fix.
Do I need a dedicated red team?
Not necessarily. Start with your development team doing structured adversarial testing. As your system grows, consider external red teamers who bring fresh perspectives and aren't familiar with your system's quirks.
What if my safety tests and user experience conflict?
This is commonāoverly aggressive safety filters frustrate users. Track false positive rates alongside safety metrics. Use nuanced responses instead of hard blocks. Balance is key: too little safety is dangerous, too much makes the product unusable.
How do I prioritize which safety issues to fix first?
Consider: severity of harm if exploited, likelihood of exploitation, number of users affected, and difficulty of fix. Critical issues affecting many users should be fixed immediately. Edge cases affecting few users can be queued.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski⢠Founder & Web Developer
Marcin is a web developer with 15+ years of experience, specializing in React, Vue, and Node.js. Based in Western Sydney, Australia, he's worked on projects for major brands including Gumtree, CommBank, Woolworths, and Optus. He uses AI tools, workflows, and agents daily in both his professional and personal life, and created Field Guide to AI to help others harness these productivity multipliers effectively.
Credentials & Experience:
- 15+ years web development experience
- Worked with major brands: Gumtree, CommBank, Woolworths, Optus, NestlƩ, M&C Saatchi
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in modern frameworks: React, Vue, Node.js
Areas of Expertise:
Prism AI⢠AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AIāa collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Capabilities:
- Powered by frontier AI models: Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google)
- Specializes in research synthesis and content drafting
- All output reviewed and verified by human experts
- Trained on authoritative AI documentation and research papers
Specializations:
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication. AI helps with research and drafting, but human expertise ensures accuracy and quality.
Key Terms Used in This Guide
Prompt
The question or instruction you give to an AI. A good prompt is clear, specific, and gives context.
Prompt Injection
A security vulnerability where users trick an AI into ignoring its instructions by inserting malicious commands into their prompts.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligenceālike understanding language, recognizing patterns, or making decisions.
Evaluation (Evals)
Systematically testing an AI system to measure how well it performs on specific tasks or criteria.
Related Guides
AI Failure Modes and Mitigations: When AI Goes Wrong
IntermediateUnderstand how AI systems fail and how to prevent failures. From hallucinations to catastrophic errorsālearn to anticipate, detect, and handle AI failures gracefully.
AI and Kids: A Parent's Safety Guide
BeginnerKids are using AI for homework, entertainment, and chatting. Learn how to keep them safe, teach responsible use, and set healthy boundaries.
AI and Privacy: What You Need to Know
BeginnerAI tools collect data to improveābut what happens to your information? Learn how to protect your privacy while using AI services.