TL;DR

Interpretability techniques explain AI decisions: attention visualization (what model focuses on), feature importance (what matters), LIME/SHAP (local explanations), and probing (what model learned).

Why interpretability matters

  • Debug model failures
  • Build trust
  • Regulatory compliance
  • Detect bias
  • Improve model

Techniques

Attention visualization: See which tokens model focuses on
Feature importance: Which inputs most influence output
LIME: Explain individual predictions with local approximation
SHAP: Game-theoretic feature attribution
Probing classifiers: Test what model learned

For language models

  • Attention maps
  • Token influence scores
  • Layer-wise analysis
  • Logit lens (predict at each layer)

Challenges

  • Large models are complex
  • Post-hoc explanations may be misleading
  • Trade-off between accuracy and interpretability

Tools

  • Transformers Interpret
  • Captum (PyTorch)
  • SHAP library
  • BertViz (attention visualization)