- Home
- /Guides
- /core concepts
- /Training Data Quality: Garbage In, Garbage Out
Training Data Quality: Garbage In, Garbage Out
AI quality depends on training data quality. Learn what makes good training data, common issues, and how to evaluate it.
TL;DR
AI is only as good as its training data. High-quality data is accurate, representative, diverse, and properly labeled. Poor data leads to biased, inaccurate, or brittle models.
What is training data?
Definition:
The dataset used to teach an AI model patterns and relationships.
For language models:
- Billions of words from books, websites, articles
- Quality varies widely
For image models:
- Millions of labeled images
- "Cat," "dog," "car," etc.
For supervised learning:
- Input-output pairs
- Model learns the mapping
Characteristics of good training data
Accurate:
- Labels are correct
- Information is factual
- No errors or noise
Representative:
- Covers the real-world distribution
- Not skewed toward one subset
- Matches deployment conditions
Diverse:
- Multiple perspectives, styles, examples
- Different demographics, contexts
- Avoids narrow patterns
Sufficient volume:
- Enough examples to learn patterns
- More data usually = better performance
- Diminishing returns after a point
Properly labeled:
- Clear, consistent annotations
- Labeled by experts when needed
- Quality over quantity
Common data quality issues
Bias:
- Over-representation of certain groups
- Reflects societal biases
- Results in unfair predictions
Noise:
- Mislabeled examples
- Random errors
- Degrades performance
Duplication:
- Same data appears multiple times
- Model memorizes instead of generalizing
- Inflates metrics
Imbalance:
- One class dominates (e.g., 99% negative, 1% positive)
- Model ignores minority class
- Needs rebalancing techniques
Outdated data:
- World changes, data doesn't
- Model gives stale answers
- Needs periodic updates
Real-world examples
Good data:
- Wikipedia for general knowledge
- Curated books for language
- Expert-labeled medical images
Bad data:
- Scraped social media (full of spam, bias)
- Auto-generated labels (often wrong)
- Data from single source (not representative)
How to evaluate training data
Questions to ask:
- Where did it come from?
- How was it labeled?
- Is it representative?
- How old is it?
- What biases might it contain?
Red flags:
- Unknown provenance
- Automatic labeling without review
- Single demographic or source
- Unfiltered internet data
Data augmentation
- Creating variations of existing data
- Examples: Rotate images, paraphrase text
- Increases diversity without new collection
- Helps with small datasets
Data cleaning
Steps:
- Remove duplicates
- Fix mislabeled examples
- Filter low-quality samples
- Balance class distribution
- Standardize formats
Garbage in, garbage out
Example 1:
- Hiring AI trained on historical hires (all male engineers)
- Result: Biased against female candidates
Example 2:
- Medical AI trained only on one demographic
- Result: Poor performance on others
Example 3:
- Chatbot trained on toxic internet forums
- Result: Offensive, harmful outputs
Best practices
- Audit training data thoroughly
- Seek diverse, representative data
- Invest in quality labeling
- Document data sources and limitations
- Monitor for bias
- Update data regularly
What's next
- Bias Detection and Mitigation
- Model Evaluation
- Responsible AI Development
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Related Guides
AI Model Architectures: A High-Level Overview
IntermediateFrom transformers to CNNs to diffusion modelsāunderstand the different AI architectures and what they're good at.
Context Windows: How Much AI Can Remember
IntermediateContext windows determine how much text an AI can process at once. Learn how they work, their limits, and how to work within them.
Embeddings: Turning Words into Math
IntermediateEmbeddings convert text into numbers that capture meaning. Essential for search, recommendations, and RAG systems.