TL;DR

Adversarial attacks fool AI with small input perturbations. Defend with adversarial training, input validation, ensemble methods, and monitoring for anomalies.

Attack types

Adversarial examples: Slightly modified inputs cause misclassification
Data poisoning: Inject malicious data into training set
Model inversion: Reconstruct training data from model
Backdoor attacks: Trigger specific behaviors with hidden patterns

Defenses

Adversarial training: Train on adversarial examples
Input preprocessing: Detect and remove perturbations
Ensemble methods: Multiple models harder to fool
Randomization: Add noise to break attacks
Certified defenses: Provable robustness guarantees

Testing robustness

  • Generate adversarial examples
  • Measure attack success rate
  • Test across different attack methods
  • Red teaming

For LLMs