- Home
- /Guides
- /core-concepts
- /Training Data Quality: Garbage In, Garbage Out
Training Data Quality: Garbage In, Garbage Out
AI quality depends on training data quality. Learn what makes good training data, common issues, and how to evaluate it.
TL;DR
AI is only as good as its training data. High-quality data is accurate, representative, diverse, and properly labeled. Poor data leads to biased, inaccurate, or brittle models.
What is training data?
Definition:
The dataset used to teach an AI model patterns and relationships.
For language models:
- Billions of words from books, websites, articles
- Quality varies widely
For image models:
- Millions of labeled images
- "Cat," "dog," "car," etc.
For supervised learning:
- Input-output pairs
- Model learns the mapping
Characteristics of good training data
Accurate:
- Labels are correct
- Information is factual
- No errors or noise
Representative:
- Covers the real-world distribution
- Not skewed toward one subset
- Matches deployment conditions
Diverse:
- Multiple perspectives, styles, examples
- Different demographics, contexts
- Avoids narrow patterns
Sufficient volume:
- Enough examples to learn patterns
- More data usually = better performance
- Diminishing returns after a point
Properly labeled:
- Clear, consistent annotations
- Labeled by experts when needed
- Quality over quantity
Common data quality issues
Bias:
- Over-representation of certain groups
- Reflects societal biases
- Results in unfair predictions
Noise:
- Mislabeled examples
- Random errors
- Degrades performance
Duplication:
- Same data appears multiple times
- Model memorizes instead of generalizing
- Inflates metrics
Imbalance:
- One class dominates (e.g., 99% negative, 1% positive)
- Model ignores minority class
- Needs rebalancing techniques
Outdated data:
- World changes, data doesn't
- Model gives stale answers
- Needs periodic updates
Real-world examples
Good data:
- Wikipedia for general knowledge
- Curated books for language
- Expert-labeled medical images
Bad data:
- Scraped social media (full of spam, bias)
- Auto-generated labels (often wrong)
- Data from single source (not representative)
How to evaluate training data
Questions to ask:
- Where did it come from?
- How was it labeled?
- Is it representative?
- How old is it?
- What biases might it contain?
Red flags:
- Unknown provenance
- Automatic labeling without review
- Single demographic or source
- Unfiltered internet data
Data augmentation
- Creating variations of existing data
- Examples: Rotate images, paraphrase text
- Increases diversity without new collection
- Helps with small datasets
Data cleaning
Steps:
- Remove duplicates
- Fix mislabeled examples
- Filter low-quality samples
- Balance class distribution
- Standardize formats
Garbage in, garbage out
Example 1:
- Hiring AI trained on historical hires (all male engineers)
- Result: Biased against female candidates
Example 2:
- Medical AI trained only on one demographic
- Result: Poor performance on others
Example 3:
- Chatbot trained on toxic internet forums
- Result: Offensive, harmful outputs
Best practices
- Audit training data thoroughly
- Seek diverse, representative data
- Invest in quality labeling
- Document data sources and limitations
- Monitor for bias
- Update data regularly
What's next
- Bias Detection and Mitigation
- Model Evaluation
- Responsible AI Development
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Training
The process of feeding data to an AI system so it learns patterns and improves its predictions over time.
Training Data
The collection of examples an AI system learns from. The quality, quantity, and diversity of training data directly determines what the AI can and cannot do.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligenceālike understanding language, recognizing patterns, or making decisions.
Related Guides
AI Model Architectures: A High-Level Overview
IntermediateFrom transformers to CNNs to diffusion modelsāunderstand the different AI architectures and what they're good at.
Context Windows: How Much AI Can Remember
IntermediateContext windows determine how much text an AI can process at once. Learn how they work, their limits, and how to work within them.
Embeddings: Turning Words into Math
IntermediateEmbeddings convert text into numbers that capture meaning. Essential for search, recommendations, and RAG systems.