TL;DR

Feature engineering transforms raw data into inputs that ML models can use effectively. Good features often matter more than fancy algorithms. The process involves selecting relevant attributes, transforming them appropriately, and creating new features that capture important patterns.

Why it matters

Raw data is rarely in the right format for ML. Dates are just text, categories need encoding, and the most predictive information might be hidden in combinations of fields. Feature engineering bridges the gap between data and model—often the difference between a mediocre model and a great one.

What are features?

Features defined

Features are the input variables your model uses to make predictions:

Example: Predicting house prices

  • Raw data: Address, description, photos, listing date
  • Features: Square footage, bedrooms, bathrooms, age, neighborhood, distance to transit

Features are what you feed to the model. Better features = better predictions.

Feature types

Type Examples Encoding needed
Numerical Age, price, count Usually ready to use
Categorical Color, category, type Needs encoding
Text Description, reviews Needs embedding/encoding
Temporal Dates, timestamps Extract components
Boolean Yes/no, true/false Convert to 0/1

Core feature engineering techniques

Handling numerical features

Scaling:
Different features may have different ranges. Scaling puts them on similar scales.

  • Min-max scaling: Transform to 0-1 range
  • Standardization: Transform to mean=0, std=1
  • Log transformation: Handle skewed distributions

When to scale:

  • Distance-based algorithms (KNN, clustering)
  • Neural networks
  • Regularized models

Binning:
Convert continuous values to categories:

  • Age → age groups (18-25, 26-35, etc.)
  • Income → income brackets
  • Useful when relationship is non-linear

Handling categorical features

One-hot encoding:
Create binary columns for each category:

Color: [red, blue, green]
→
is_red: [1, 0, 0]
is_blue: [0, 1, 0]
is_green: [0, 0, 1]

Label encoding:
Assign numbers to categories:

Color: [red, blue, green] → [0, 1, 2]

Use when categories have natural order.

Target encoding:
Replace category with average target value:

City → Average house price in that city

Careful: Can cause data leakage if not done properly.

Handling dates and times

Extract meaningful components:

  • Year, month, day, day of week
  • Hour, minute (for timestamps)
  • Is weekend, is holiday
  • Days since event
  • Time until next event

Example transformations:

Date: 2024-03-15
→
year: 2024
month: 3
day_of_week: 5 (Friday)
is_weekend: 0
quarter: 1
days_since_start: 75

Handling text

Simple approaches:

  • Word counts
  • Character counts
  • Presence of keywords

Advanced approaches:

Creating new features

Interaction features

Combine features to capture relationships:

Multiplication:

price_per_sqft = price / square_feet

Ratios:

bedroom_bathroom_ratio = bedrooms / bathrooms

Combinations:

location_size = neighborhood + "_" + size_category

Aggregation features

Summarize related data:

For customer prediction:

  • Total purchases last 30 days
  • Average order value
  • Number of returns
  • Days since last purchase

For time series:

  • Rolling averages
  • Rolling min/max
  • Trend indicators

Domain-specific features

Use domain knowledge to create meaningful features:

Real estate:

  • Price per square foot
  • School district quality
  • Crime rate in area
  • Walk score

E-commerce:

  • Items viewed before purchase
  • Cart abandonment rate
  • Seasonal purchase patterns

Finance:

  • Debt-to-income ratio
  • Payment history score
  • Account age

Feature selection

Not all features help. Some hurt performance or add unnecessary complexity.

Why select features?

  • Remove noise
  • Reduce overfitting
  • Speed up training
  • Improve interpretability
  • Reduce storage/compute costs

Selection approaches

Filter methods:
Evaluate features independently:

  • Correlation with target
  • Statistical tests
  • Information gain

Wrapper methods:
Evaluate feature subsets:

  • Forward selection (add features one at a time)
  • Backward elimination (remove features one at a time)
  • Recursive feature elimination

Embedded methods:
Feature selection during training:

  • Lasso regularization (drives unimportant weights to zero)
  • Tree-based feature importance

Common pitfalls

Data leakage

Using information that wouldn't be available at prediction time:

Problem:

Predicting: Will customer churn?
Feature: customer_churned_flag  # This IS the answer!

Subtle leakage:

  • Using future data for past predictions
  • Aggregates including target period
  • Features derived from target

Prevention:

  • Think carefully about what's available at prediction time
  • Split data before feature engineering
  • Review features for leakage

High cardinality

Too many categories:

Problem:
One-hot encoding of 10,000 categories = 10,000 columns

Solutions:

  • Group rare categories
  • Use embeddings
  • Target encoding
  • Hash encoding

Missing values

Handle thoughtfully:

Options:

  • Remove rows (if few missing)
  • Impute with mean/median/mode
  • Impute with model predictions
  • Create "is_missing" indicator
  • Use models that handle missing values

Feature engineering workflow

Process

  1. Explore data: Understand distributions, relationships
  2. Clean data: Handle missing values, outliers
  3. Transform features: Encode, scale, normalize
  4. Create features: Interactions, aggregations, domain features
  5. Select features: Remove unhelpful features
  6. Validate: Ensure no leakage, test performance

Iteration

Feature engineering is iterative:

  • Start with basic features
  • Train model, evaluate
  • Analyze errors
  • Create features to address errors
  • Repeat

Common mistakes

Mistake Problem Solution
Data leakage Overly optimistic results Strict temporal splits
Over-engineering Complexity without benefit Start simple, add as needed
Ignoring domain knowledge Missing obvious features Consult domain experts
Not handling missing data Model errors or bias Explicit missing data strategy
Scaling after split Data leakage Fit scaler on training only

What's next

Continue building ML skills: