Active Learning: Smart Data Labeling
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Reduce labeling costs by intelligently selecting which examples to label. Active learning strategies for efficient model training.
TL;DR
Active learning is a strategy where your AI model picks the most useful examples for humans to label, instead of labeling everything randomly. This can cut labeling costs by 50-90%, turning a $100K labeling project into a $20K one while getting the same (or better) model performance.
Why it matters
Labeling data is one of the biggest bottlenecks in machine learning. If you're building a model to detect defective products on a factory line, you might have 500,000 images, but getting a human expert to label each one costs real money and real time. At $0.20 per label, that's $100,000. Active learning flips the script: instead of labeling everything, your model tells you which 100,000 images are actually worth labeling. You spend $20,000, and your model learns just as well, sometimes even better, because it focused on the examples that mattered.
This isn't just theory. Companies building real ML products use active learning to ship faster with smaller budgets. If you're working with limited annotation resources (and most teams are), active learning is one of the highest-leverage techniques you can adopt.
How active learning works, step by step
Think of active learning like studying for an exam with a tutor. A bad study strategy is reading every page of every textbook cover to cover. A good study strategy is taking a practice test, identifying the topics you got wrong, and studying those. Active learning works the same way for AI models.
Here's the cycle:
1. Start small. Train an initial model on a small labeled dataset, maybe 1-5% of your total data. The model won't be great, but it doesn't need to be.
2. Score the unlabeled data. Run your model against all the unlabeled examples and score each one by how "useful" it would be to label. This is where the strategy matters (more on that below).
3. Select the best batch. Pick the top K examples, usually a few hundred to a few thousand, based on your scoring.
4. Get labels. Send those examples to human annotators (or domain experts, or a labeling service).
5. Retrain. Add the newly labeled examples to your training set and retrain the model.
6. Repeat. Go back to step 2. Each cycle, your model gets better and more efficient at picking what to label next.
Most teams run 5-15 cycles before reaching satisfactory performance, and each cycle gives you a model you can evaluate to decide whether to keep going or stop.
Choosing a sampling strategy
The "scoring" in step 2 is where the magic happens. There are several strategies, and the right choice depends on your problem.
Uncertainty sampling
How it works: Label the examples your model is least confident about. If your spam classifier gives an email a 51% chance of being spam, that's a high-value example to label because the model genuinely doesn't know.
Best for: Classification problems where you want to sharpen decision boundaries. This is the most popular strategy because it's simple and effective.
Example: A medical imaging model is 98% sure a scan is healthy and 52% sure another scan is healthy. Uncertainty sampling picks the 52% scan first because that's where the model needs help.
Diversity sampling
How it works: Select examples that are spread across the full range of your data, making sure you don't just label variations of the same thing.
Best for: Problems where your data has many distinct clusters or categories. If uncertainty sampling keeps picking examples from one tricky region, diversity sampling makes sure you cover the whole landscape.
Example: You're building a plant disease classifier. Uncertainty sampling might keep picking blurry leaf photos (hard for any model). Diversity sampling ensures you also label examples from underrepresented plant species.
Query-by-committee
How it works: Train several models (the "committee") on your current data. When they disagree on an unlabeled example, that's a high-value label.
Best for: Situations where you can train multiple models cheaply. The disagreement signal is often stronger than single-model uncertainty.
Combined approaches
In practice, the best results often come from combining strategies. A common approach is to use uncertainty sampling to find confusing examples, then apply diversity filtering to make sure you're not labeling 500 nearly identical confusing examples. This combined approach consistently outperforms any single strategy.
When active learning helps (and when it doesn't)
Active learning shines when:
- Labeling is expensive (medical experts, legal review, specialized knowledge)
- You have a large pool of unlabeled data (100K+ examples)
- Your model needs to distinguish between subtle differences
- Your budget is fixed and you need the most value per dollar
Active learning is less useful when:
- Labeling is cheap and fast (basic sentiment analysis with crowd workers)
- Your dataset is small enough to label entirely (under 5,000 examples)
- Your data is highly uniform with no hard cases to focus on
- You need labels for other purposes beyond model training (compliance, auditing)
Common mistakes
Labeling only uncertain examples. If you only label the hard cases, your model might develop a skewed view of the data. Always mix in some random examples (10-20% of each batch) to keep the model grounded in the overall data distribution.
Not having a quality baseline. Without a held-out test set labeled independently, you can't tell if active learning is actually helping. Always set aside a random test set before you start.
Batches that are too small. Labeling 10 examples, retraining, and labeling 10 more is inefficient because of retraining costs. Batches of 200-1,000 usually hit the sweet spot between efficiency and information gain.
Ignoring annotator disagreement. When human labelers disagree on an example, that's valuable signal. Those examples might be genuinely ambiguous, and your model needs to handle them gracefully. Don't just take the majority vote and move on.
Stopping too early. Active learning has diminishing returns, but many teams stop before those returns fully diminish. Track your model's performance on the test set after each cycle and set a clear stopping criterion (such as "stop when accuracy improves less than 0.5% per cycle").
What's next?
Active learning connects to several related concepts worth exploring:
- AI Evaluation Metrics — Measuring whether your model is actually improving after each active learning cycle
- Data Labeling Best Practices — How to run annotation projects efficiently
- Fine-Tuning Fundamentals — The training process that active learning feeds into
Frequently Asked Questions
How much money can active learning actually save?
Typical savings are 50-90% of labeling costs. A project that would cost $100K to label randomly might need only $10-20K with active learning. The exact savings depend on your data: the more redundant your unlabeled pool, the more active learning helps.
Do I need a lot of labeled data to start active learning?
No. You start with a small seed set, often just 100-500 labeled examples. The whole point is to be strategic about which additional examples get labeled. The seed set just needs to be large enough to train an initial (rough) model.
Can I use active learning with deep learning models?
Yes, though it requires some care. Deep learning models can be overconfident in their predictions, making uncertainty estimates unreliable. Techniques like Monte Carlo dropout or ensembles help produce better uncertainty scores for deep learning active learning loops.
What tools support active learning workflows?
Prodigy (by the spaCy team) combines annotation with active learning built in. Label Studio has active learning integrations. For custom setups, frameworks like modAL (Python) provide the building blocks. Many teams build custom pipelines on top of their existing ML infrastructure.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Training
The process of feeding large amounts of data to an AI system so it learns patterns, relationships, and rules, enabling it to make predictions or generate output.
Model
The trained AI system that contains all the patterns and knowledge learned from data. It's the end product of training—the 'brain' that takes inputs and produces predictions, decisions, or generated content.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Machine Learning (ML)
A branch of artificial intelligence where computers learn patterns from data and improve at tasks through experience, rather than following explicitly programmed rules.
Related Guides
Continual Learning: Models That Keep Learning
AdvancedTrain models on new data without forgetting old knowledge. Continual learning strategies for evolving AI systems.
6 min readMachine Learning Fundamentals: How Machines Learn from Data
BeginnerUnderstand the basics of machine learning. From training to inference—a practical introduction to how ML systems work without deep math or coding.
11 min readSupervised vs Unsupervised Learning: When to Use Which
BeginnerUnderstand the difference between supervised and unsupervised learning. Learn when to use each approach with practical examples and decision frameworks.
9 min read