TL;DR

Active learning selects most informative examples for labeling, reducing costs. Query uncertain examples, diverse samples, or potential errors to improve model efficiently.

Strategies

Uncertainty sampling: Label examples model is least confident about
Query-by-committee: Multiple models vote, label disagreements
Expected model change: Label examples that would change model most
Diversity sampling: Label diverse examples to cover distribution

Implementation

  1. Train initial model on small labeled set
  2. Score unlabeled data by informativeness
  3. Select top K examples
  4. Get labels (human or automated)
  5. Retrain model
  6. Repeat

Benefits

  • 50-90% labeling cost reduction typical
  • Faster to useful model
  • Focuses human effort on hard cases

Tools

  • Modal (active learning platform)
  • Prodigy (annotation + active learning)
  • Custom implementation