- Home
- /Guides
- /machine-learning
- /Feature Engineering Basics: Preparing Data for Machine Learning
Feature Engineering Basics: Preparing Data for Machine Learning
Learn how to transform raw data into useful features for machine learning. Practical techniques for creating better inputs that improve model performance.
By Marcin Piekarski ⢠Founder & Web Developer ⢠builtweb.com.au
AI-Assisted by: Prism AI (Prism AI represents the collaborative AI assistance in content creation.)
Last Updated: 7 December 2025
TL;DR
Feature engineering transforms raw data into inputs that ML models can use effectively. Good features often matter more than fancy algorithms. The process involves selecting relevant attributes, transforming them appropriately, and creating new features that capture important patterns.
Why it matters
Raw data is rarely in the right format for ML. Dates are just text, categories need encoding, and the most predictive information might be hidden in combinations of fields. Feature engineering bridges the gap between data and modelāoften the difference between a mediocre model and a great one.
What are features?
Features defined
Features are the input variables your model uses to make predictions:
Example: Predicting house prices
- Raw data: Address, description, photos, listing date
- Features: Square footage, bedrooms, bathrooms, age, neighborhood, distance to transit
Features are what you feed to the model. Better features = better predictions.
Feature types
| Type | Examples | Encoding needed |
|---|---|---|
| Numerical | Age, price, count | Usually ready to use |
| Categorical | Color, category, type | Needs encoding |
| Text | Description, reviews | Needs embedding/encoding |
| Temporal | Dates, timestamps | Extract components |
| Boolean | Yes/no, true/false | Convert to 0/1 |
Core feature engineering techniques
Handling numerical features
Scaling:
Different features may have different ranges. Scaling puts them on similar scales.
- Min-max scaling: Transform to 0-1 range
- Standardization: Transform to mean=0, std=1
- Log transformation: Handle skewed distributions
When to scale:
- Distance-based algorithms (KNN, clustering)
- Neural networks
- Regularized models
Binning:
Convert continuous values to categories:
- Age ā age groups (18-25, 26-35, etc.)
- Income ā income brackets
- Useful when relationship is non-linear
Handling categorical features
One-hot encoding:
Create binary columns for each category:
Color: [red, blue, green]
ā
is_red: [1, 0, 0]
is_blue: [0, 1, 0]
is_green: [0, 0, 1]
Label encoding:
Assign numbers to categories:
Color: [red, blue, green] ā [0, 1, 2]
Use when categories have natural order.
Target encoding:
Replace category with average target value:
City ā Average house price in that city
Careful: Can cause data leakage if not done properly.
Handling dates and times
Extract meaningful components:
- Year, month, day, day of week
- Hour, minute (for timestamps)
- Is weekend, is holiday
- Days since event
- Time until next event
Example transformations:
Date: 2024-03-15
ā
year: 2024
month: 3
day_of_week: 5 (Friday)
is_weekend: 0
quarter: 1
days_since_start: 75
Handling text
Simple approaches:
- Word counts
- Character counts
- Presence of keywords
Advanced approaches:
- TF-IDF vectors
- Word embeddings
- Sentence embeddings
- LLM embeddings
Creating new features
Interaction features
Combine features to capture relationships:
Multiplication:
price_per_sqft = price / square_feet
Ratios:
bedroom_bathroom_ratio = bedrooms / bathrooms
Combinations:
location_size = neighborhood + "_" + size_category
Aggregation features
Summarize related data:
For customer prediction:
- Total purchases last 30 days
- Average order value
- Number of returns
- Days since last purchase
For time series:
- Rolling averages
- Rolling min/max
- Trend indicators
Domain-specific features
Use domain knowledge to create meaningful features:
Real estate:
- Price per square foot
- School district quality
- Crime rate in area
- Walk score
E-commerce:
- Items viewed before purchase
- Cart abandonment rate
- Seasonal purchase patterns
Finance:
- Debt-to-income ratio
- Payment history score
- Account age
Feature selection
Not all features help. Some hurt performance or add unnecessary complexity.
Why select features?
- Remove noise
- Reduce overfitting
- Speed up training
- Improve interpretability
- Reduce storage/compute costs
Selection approaches
Filter methods:
Evaluate features independently:
- Correlation with target
- Statistical tests
- Information gain
Wrapper methods:
Evaluate feature subsets:
- Forward selection (add features one at a time)
- Backward elimination (remove features one at a time)
- Recursive feature elimination
Embedded methods:
Feature selection during training:
- Lasso regularization (drives unimportant weights to zero)
- Tree-based feature importance
Common pitfalls
Data leakage
Using information that wouldn't be available at prediction time:
Problem:
Predicting: Will customer churn?
Feature: customer_churned_flag # This IS the answer!
Subtle leakage:
- Using future data for past predictions
- Aggregates including target period
- Features derived from target
Prevention:
- Think carefully about what's available at prediction time
- Split data before feature engineering
- Review features for leakage
High cardinality
Too many categories:
Problem:
One-hot encoding of 10,000 categories = 10,000 columns
Solutions:
- Group rare categories
- Use embeddings
- Target encoding
- Hash encoding
Missing values
Handle thoughtfully:
Options:
- Remove rows (if few missing)
- Impute with mean/median/mode
- Impute with model predictions
- Create "is_missing" indicator
- Use models that handle missing values
Feature engineering workflow
Process
- Explore data: Understand distributions, relationships
- Clean data: Handle missing values, outliers
- Transform features: Encode, scale, normalize
- Create features: Interactions, aggregations, domain features
- Select features: Remove unhelpful features
- Validate: Ensure no leakage, test performance
Iteration
Feature engineering is iterative:
- Start with basic features
- Train model, evaluate
- Analyze errors
- Create features to address errors
- Repeat
Common mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Data leakage | Overly optimistic results | Strict temporal splits |
| Over-engineering | Complexity without benefit | Start simple, add as needed |
| Ignoring domain knowledge | Missing obvious features | Consult domain experts |
| Not handling missing data | Model errors or bias | Explicit missing data strategy |
| Scaling after split | Data leakage | Fit scaler on training only |
What's next
Continue building ML skills:
- Machine Learning Fundamentals ā ML basics
- Supervised vs Unsupervised ā Learning types
- AI Data Privacy ā Data handling best practices
Frequently Asked Questions
How many features should I create?
Quality over quantity. Start with features that make intuitive sense, evaluate their impact, and iterate. Too many features can cause overfitting and slow training. Let performance metrics guide you.
Should I always scale features?
Not always. Tree-based models (Random Forest, XGBoost) don't need scaling. Neural networks and distance-based algorithms do. Linear models benefit from scaling but can work without it.
How do I know if I have data leakage?
Suspiciously high accuracy is a red flag. Check if any feature uses information from the future or encodes the target variable. Test on truly held-out data. If training accuracy >> test accuracy, investigate for leakage.
Is feature engineering still important with deep learning?
Less so for some data types (images, text) where deep learning learns features automatically. Still important for structured/tabular data. Even with deep learning, thoughtful input engineering often helps.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski⢠Founder & Web Developer
Marcin is a web developer with 15+ years of experience, specializing in React, Vue, and Node.js. Based in Western Sydney, Australia, he's worked on projects for major brands including Gumtree, CommBank, Woolworths, and Optus. He uses AI tools, workflows, and agents daily in both his professional and personal life, and created Field Guide to AI to help others harness these productivity multipliers effectively.
Credentials & Experience:
- 15+ years web development experience
- Worked with major brands: Gumtree, CommBank, Woolworths, Optus, NestlƩ, M&C Saatchi
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in modern frameworks: React, Vue, Node.js
Areas of Expertise:
Prism AI⢠AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AIāa collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Capabilities:
- Powered by frontier AI models: Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google)
- Specializes in research synthesis and content drafting
- All output reviewed and verified by human experts
- Trained on authoritative AI documentation and research papers
Specializations:
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication. AI helps with research and drafting, but human expertise ensures accuracy and quality.
Key Terms Used in This Guide
Related Guides
Machine Learning Fundamentals: How Machines Learn from Data
BeginnerUnderstand the basics of machine learning. From training to inferenceāa practical introduction to how ML systems work without deep math or coding.
Supervised vs Unsupervised Learning: When to Use Which
BeginnerUnderstand the difference between supervised and unsupervised learning. Learn when to use each approach with practical examples and decision frameworks.
Data Preparation for AI: Getting Your Data Ready
IntermediateLearn to prepare data for AI and machine learning. From cleaning to transformationāpractical guidance for the often-overlooked work that makes AI possible.