Intermediate10 min read

Feature Engineering Basics: Preparing Data for Machine Learning

Learn how to transform raw data into useful features for machine learning. Practical techniques for creating better inputs that improve model performance.

By Marcin Piekarski • Frontend Lead & AI Educator • builtweb.com.au

AI-Assisted by: Prism AI (Prism AI represents the collaborative AI assistance in content creation.)

Last Updated: 7 December 2025

feature engineeringdata preparationmachine learningdata science

TL;DR

Feature engineering transforms raw data into inputs that ML models can use effectively. Good features often matter more than fancy algorithms. The process involves selecting relevant attributes, transforming them appropriately, and creating new features that capture important patterns.

Why it matters

Raw data is rarely in the right format for ML. Dates are just text, categories need encoding, and the most predictive information might be hidden in combinations of fields. Feature engineering bridges the gap between data and model—often the difference between a mediocre model and a great one.

What are features?

Features defined

Features are the input variables your model uses to make predictions:

Example: Predicting house prices

Raw data: Address, description, photos, listing date
Features: Square footage, bedrooms, bathrooms, age, neighborhood, distance to transit

Features are what you feed to the model. Better features = better predictions.

Feature types

Type	Examples	Encoding needed
Numerical	Age, price, count	Usually ready to use
Categorical	Color, category, type	Needs encoding
Text	Description, reviews	Needs embedding/encoding
Temporal	Dates, timestamps	Extract components
Boolean	Yes/no, true/false	Convert to 0/1

Core feature engineering techniques

Handling numerical features

Scaling:
Different features may have different ranges. Scaling puts them on similar scales.

Min-max scaling: Transform to 0-1 range
Standardization: Transform to mean=0, std=1
Log transformation: Handle skewed distributions

When to scale:

Distance-based algorithms (KNN, clustering)
Neural networks
Regularized models

Binning:
Convert continuous values to categories:

Age → age groups (18-25, 26-35, etc.)
Income → income brackets
Useful when relationship is non-linear

Handling categorical features

One-hot encoding:
Create binary columns for each category:

Color: [red, blue, green]
→
is_red: [1, 0, 0]
is_blue: [0, 1, 0]
is_green: [0, 0, 1]

Label encoding:
Assign numbers to categories:

Color: [red, blue, green] → [0, 1, 2]

Use when categories have natural order.

Target encoding:
Replace category with average target value:

City → Average house price in that city

Careful: Can cause data leakage if not done properly.

Handling dates and times

Extract meaningful components:

Year, month, day, day of week
Hour, minute (for timestamps)
Is weekend, is holiday
Days since event
Time until next event

Example transformations:

Date: 2024-03-15
→
year: 2024
month: 3
day_of_week: 5 (Friday)
is_weekend: 0
quarter: 1
days_since_start: 75

Handling text

Simple approaches:

Word counts
Character counts
Presence of keywords

Advanced approaches:

TF-IDF vectors
Word embeddings
Sentence embeddings
LLM embeddings

Creating new features

Interaction features

Combine features to capture relationships:

Multiplication:

price_per_sqft = price / square_feet

Ratios:

bedroom_bathroom_ratio = bedrooms / bathrooms

Combinations:

location_size = neighborhood + "_" + size_category

Aggregation features

Summarize related data:

For customer prediction:

Total purchases last 30 days
Average order value
Number of returns
Days since last purchase

For time series:

Rolling averages
Rolling min/max
Trend indicators

Domain-specific features

Use domain knowledge to create meaningful features:

Real estate:

Price per square foot
School district quality
Crime rate in area
Walk score

E-commerce:

Items viewed before purchase
Cart abandonment rate
Seasonal purchase patterns

Finance:

Debt-to-income ratio
Payment history score
Account age

Feature selection

Not all features help. Some hurt performance or add unnecessary complexity.

Why select features?

Remove noise
Reduce overfitting
Speed up training
Improve interpretability
Reduce storage/compute costs

Selection approaches

Filter methods:
Evaluate features independently:

Correlation with target
Statistical tests
Information gain

Wrapper methods:
Evaluate feature subsets:

Forward selection (add features one at a time)
Backward elimination (remove features one at a time)
Recursive feature elimination

Embedded methods:
Feature selection during training:

Lasso regularization (drives unimportant weights to zero)
Tree-based feature importance

Common pitfalls

Data leakage

Using information that wouldn't be available at prediction time:

Problem:

Predicting: Will customer churn?
Feature: customer_churned_flag  # This IS the answer!

Subtle leakage:

Using future data for past predictions
Aggregates including target period
Features derived from target

Prevention:

Think carefully about what's available at prediction time
Split data before feature engineering
Review features for leakage

High cardinality

Too many categories:

Problem:
One-hot encoding of 10,000 categories = 10,000 columns

Solutions:

Group rare categories
Use embeddings
Target encoding
Hash encoding

Missing values

Handle thoughtfully:

Options:

Remove rows (if few missing)
Impute with mean/median/mode
Impute with model predictions
Create "is_missing" indicator
Use models that handle missing values

Feature engineering workflow

Process

Explore data: Understand distributions, relationships
Clean data: Handle missing values, outliers
Transform features: Encode, scale, normalize
Create features: Interactions, aggregations, domain features
Select features: Remove unhelpful features
Validate: Ensure no leakage, test performance

Iteration

Feature engineering is iterative:

Start with basic features
Train model, evaluate
Analyze errors
Create features to address errors
Repeat

Common mistakes

Mistake	Problem	Solution
Data leakage	Overly optimistic results	Strict temporal splits
Over-engineering	Complexity without benefit	Start simple, add as needed
Ignoring domain knowledge	Missing obvious features	Consult domain experts
Not handling missing data	Model errors or bias	Explicit missing data strategy
Scaling after split	Data leakage	Fit scaler on training only

What's next

Continue building ML skills:

Machine Learning Fundamentals — ML basics
Supervised vs Unsupervised — Learning types
AI Data Privacy — Data handling best practices

Frequently Asked Questions

How many features should I create?

Quality over quantity. Start with features that make intuitive sense, evaluate their impact, and iterate. Too many features can cause overfitting and slow training. Let performance metrics guide you.

Should I always scale features?

Not always. Tree-based models (Random Forest, XGBoost) don't need scaling. Neural networks and distance-based algorithms do. Linear models benefit from scaling but can work without it.

How do I know if I have data leakage?

Suspiciously high accuracy is a red flag. Check if any feature uses information from the future or encodes the target variable. Test on truly held-out data. If training accuracy >> test accuracy, investigate for leakage.

Is feature engineering still important with deep learning?

Less so for some data types (images, text) where deep learning learns features automatically. Still important for structured/tabular data. Even with deep learning, thoughtful input engineering often helps.

Was this guide helpful?

Your feedback helps us improve our guides

About the Authors

Marcin Piekarski• Frontend Lead & AI Educator

Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.

Credentials & Experience:

20+ years web development experience
Frontend Lead at Harvey Norman (10 years)
Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
Runs AI workshops for teams
Founder of builtweb.com.au
Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
Specializes in React ecosystem: React, Next.js, Node.js

Areas of Expertise:

Web DevelopmentAI Tools & WorkflowsProductivity AutomationTechnical EducationUser Experience Design

Visit Website →LinkedIn Profile →

Prism AI• AI Research & Writing Assistant

Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.

Capabilities:

Powered by frontier AI models: Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google)
Specializes in research synthesis and content drafting
All output reviewed and verified by human experts
Trained on authoritative AI documentation and research papers

Specializations:

AI Research & DocumentationContent SynthesisTechnical WritingConcept ExplanationCode Examples

Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication. AI helps with research and drafting, but human expertise ensures accuracy and quality.

Key Terms Used in This Guide

Machine Learning (ML)

A way to train computers to learn from examples and data, instead of programming every rule manually.

Model

The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.

Related Guides

Machine Learning Fundamentals: How Machines Learn from Data

Beginner

Understand the basics of machine learning. From training to inference—a practical introduction to how ML systems work without deep math or coding.

11 min read

Supervised vs Unsupervised Learning: When to Use Which