TL;DR

AI is only as good as its training data. High-quality data is accurate, representative, diverse, and properly labeled. Poor data leads to biased, inaccurate, or brittle models.

What is training data?

Definition:
The dataset used to teach an AI model patterns and relationships.

For language models:

  • Billions of words from books, websites, articles
  • Quality varies widely

For image models:

  • Millions of labeled images
  • "Cat," "dog," "car," etc.

For supervised learning:

  • Input-output pairs
  • Model learns the mapping

Characteristics of good training data

Accurate:

  • Labels are correct
  • Information is factual
  • No errors or noise

Representative:

  • Covers the real-world distribution
  • Not skewed toward one subset
  • Matches deployment conditions

Diverse:

  • Multiple perspectives, styles, examples
  • Different demographics, contexts
  • Avoids narrow patterns

Sufficient volume:

  • Enough examples to learn patterns
  • More data usually = better performance
  • Diminishing returns after a point

Properly labeled:

  • Clear, consistent annotations
  • Labeled by experts when needed
  • Quality over quantity

Common data quality issues

Bias:

  • Over-representation of certain groups
  • Reflects societal biases
  • Results in unfair predictions

Noise:

  • Mislabeled examples
  • Random errors
  • Degrades performance

Duplication:

  • Same data appears multiple times
  • Model memorizes instead of generalizing
  • Inflates metrics

Imbalance:

  • One class dominates (e.g., 99% negative, 1% positive)
  • Model ignores minority class
  • Needs rebalancing techniques

Outdated data:

  • World changes, data doesn't
  • Model gives stale answers
  • Needs periodic updates

Real-world examples

Good data:

  • Wikipedia for general knowledge
  • Curated books for language
  • Expert-labeled medical images

Bad data:

  • Scraped social media (full of spam, bias)
  • Auto-generated labels (often wrong)
  • Data from single source (not representative)

How to evaluate training data

Questions to ask:

  • Where did it come from?
  • How was it labeled?
  • Is it representative?
  • How old is it?
  • What biases might it contain?

Red flags:

  • Unknown provenance
  • Automatic labeling without review
  • Single demographic or source
  • Unfiltered internet data

Data augmentation

  • Creating variations of existing data
  • Examples: Rotate images, paraphrase text
  • Increases diversity without new collection
  • Helps with small datasets

Data cleaning

Steps:

  1. Remove duplicates
  2. Fix mislabeled examples
  3. Filter low-quality samples
  4. Balance class distribution
  5. Standardize formats

Garbage in, garbage out

Example 1:

  • Hiring AI trained on historical hires (all male engineers)
  • Result: Biased against female candidates

Example 2:

  • Medical AI trained only on one demographic
  • Result: Poor performance on others

Example 3:

  • Chatbot trained on toxic internet forums
  • Result: Offensive, harmful outputs

Best practices

  1. Audit training data thoroughly
  2. Seek diverse, representative data
  3. Invest in quality labeling
  4. Document data sources and limitations
  5. Monitor for bias
  6. Update data regularly

What's next