TL;DR

Data preparation is 80% of AI work but often gets 20% of the attention. Clean, well-structured data is essential for AI success. Key steps: understand your data, clean problems, transform features, and validate quality. Investing here pays dividends in model performance.

Why it matters

AI is only as good as its data. Garbage in, garbage out. Most AI projects fail not because of algorithms but because of data issues. Proper preparation is the foundation that everything else builds on.

The data preparation pipeline

Overview

Raw data → Understanding → Cleaning → Transformation → Validation → AI-ready data

Time allocation

Phase Typical time
Understanding 10-15%
Cleaning 30-40%
Transformation 20-30%
Validation 10-15%
Modeling 15-25%

Yes, data prep is most of the work.

Understanding your data

Exploratory analysis

Before changing anything, understand what you have:

Basic statistics:

  • Row and column counts
  • Data types
  • Missing values
  • Unique values

Distributions:

  • Numeric variable ranges
  • Category frequencies
  • Outliers and anomalies

Relationships:

  • Correlations
  • Patterns over time
  • Group differences

Key questions

  • What does each field mean?
  • What values are valid?
  • What's the data source?
  • How was it collected?
  • What might be wrong?

Cleaning data

Handling missing values

Options:

Approach When to use
Remove rows Few missing, random
Remove columns Most values missing
Impute mean/median Numeric, missing at random
Impute mode Categorical
Create "missing" category Missing may be meaningful
Model-based imputation Important to fill accurately

Fixing errors

Common issues:

  • Typos and inconsistent spelling
  • Wrong data types
  • Invalid values
  • Duplicate records
  • Formatting inconsistencies

Approaches:

  • Standardize formats
  • Use validation rules
  • Flag and review outliers
  • Deduplicate carefully

Handling outliers

Detection:

  • Statistical methods (IQR, z-scores)
  • Visual inspection
  • Domain knowledge

Treatment:

  • Verify if real or error
  • Keep if legitimate
  • Cap or remove if error
  • Consider robust methods

Transforming data

Feature engineering

Create useful inputs for models:

Common transformations:

  • Date → day of week, month, is_weekend
  • Text → word count, keywords, embeddings
  • Categories → one-hot encoding
  • Numbers → binning, normalization

Normalization and scaling

Put features on similar scales:

Method How it works When to use
Min-max Scale to 0-1 Bounded range needed
Standard Mean=0, std=1 Distance-based methods
Log Log transform Skewed distributions

Encoding categories

Convert categories to numbers:

One-hot encoding:

  • Color: red, blue, green
  • → is_red, is_blue, is_green

Label encoding:

  • For ordinal categories
  • low=0, medium=1, high=2

Train/test split

Separate data for training and evaluation:

Rules:

  • Split before any other processing
  • Typical: 70-80% train, 20-30% test
  • Add validation set if tuning
  • Time-based split for temporal data

Validation

Quality checks

After preparation, verify:

  • No data leakage from test to train
  • Missing values handled
  • No unexpected values
  • Distributions as expected
  • Features properly encoded
  • Labels are correct

Documentation

Record what you did:

  • Decisions made and why
  • Transformations applied
  • Data issues found
  • Assumptions made

Common mistakes

Mistake Impact Prevention
Leaking test data Invalid results Split first
Ignoring missing data Model failures Explicit handling
Over-cleaning Loss of signal Preserve real variation
Under-documenting Can't reproduce Document everything
Assuming data is clean Hidden problems Always validate

What's next

Continue building AI data skills: