The Feature Engineering Guide

Mikiko Bazeley
May 24, 2023

Table of Contents

Feature Engineering Fundamentals

  • Defining Feature Engineering
  • The Purposes of Feature Engineering
  • Why is Feature Engineering (Still) Important
  • Essential Terminology

Feature Engineering & The Data Science Workflow

  • The Lifecycle of a Machine Learning Project
  • Phase 1: The Problem Definition & Project Scoping Phase
  • Phase 2: Understanding The Model Development Phase
  • Phase 3: The Model Deployment Phase
  • Phase 4: The Model Maintenance Phase
  • The Data Cycle: Acquiring, Wrangling, & Exploring Data
  • The Feature Engineering Cycle
  • The Model Training Cycle
  • The Model Development Loop(s)

Feature Engineering Deep-Dive

  • The Four Steps of Feature Engineering
  • Feature Transformation
  • Feature Extraction
  • Feature Learning
  • Feature Selection

The Lifecycle of a Feature

  • Features Last As Long As They’re Useful
  • Feature Importance & Feature Generalization
  • Importance & Generalizability Drive The Feature Engineering Cycle

Feature Engineering Fundamentals

Defining Feature Engineering

Feature engineering refers to the process of transforming data into useful representations (features) either to boost model inference, reduce computational footprints, and improve interpretability.

How can we transform the data we’ve collected into valuable and performant machine learning products? 

The answer: Through feature engineering. 

Data can be generated and used to describe events, objects and concepts. 

For example, data can include purchase transactions, an individual’s social media feed, or the current economic status of a country and its population.  

Data can be gathered manually (e.g. collected during surveys) or automatically (e.g. through sensors, applications, etc) and can take different types and forms. 

Data can loosely be organized in three categories of data structures: unstructured, semi-structured, and structured data.

Data Structures: unstructured, semi-structured, and structured data.

Typically raw data can’t be used as a direct input to a machine learning model unless that raw form has been transformed and structured upstream already. Feature engineering is essential for data scientists and companies to create predictive models from everyday data.

The Purposes of Feature Engineering

1. Improve Model Performance 

The most important purpose of feature engineering is to improve the performance of a machine learning model or pipeline. 

If a model is a meal, a result of the data used to train it, then features are the recipe that links the model back to the data. Data scientists attempt to derive and structure their datasets such that the model can optimally learn the relationships of the feature to targets. Not all features are created equal and the goal is to curate and create the subset of features that provide the greatest predictive power for a machine learning model.

Feature Relevance

Feature engineering is particularly helpful in projects where datasets are small (<10K) and as much information needs to be extracted as possible.

Effective feature engineering requires a combination of subject matter expertise, problem definition, exploratory data analysis, and iteration through the transformation-selection-evaluation cycle in order to achieve the best results. 

Ultimately your goal is to transform your data into a structure that best represents the underlying problem that your machine learning algorithms are attempting to model. 

A data scientist’s goal is to structure their data to model the underlying problem for their machine learning algorithms.

2. Reduce Computational Costs

Effective feature engineering decreases computational & storage costs of the model as well as improves latency for both training models and serving predictions. 

Reduced computational cost is due to reduced computational requirements, with ROI increasing along with application performance and user satisfaction.

Computational effectiveness is improved by feature engineering through:

  • Reducing the number of features, and consequently the amount of data, needed to be processed and stored for training;
  • Reducing the number of features and data in an API call for a live service;
  • Ensuring the data that is used is valuable and provides predictive power for the model, increasing its usefulness to users and value for the business;
  • Write once, serve twice – well written feature definitions that are versioned and tested can be mirrored for both training and serving, if a data scientist chooses to use a feature store. 
  • Snapshotting the exact business logic and definitions used for the model for future users and developers, including caching feature values. 

In other words, reducing computational costs by using only the 20% of data and features that will drive the 80% of predictive power.

3. Boost Model Interpretability

Machine learning models continue to be under scrutiny. 

Model interpretability, defined as “the degree to which a human can consistently predict the model’s result”, is highly valuable and even required in many machine learning use cases. 

Model interpretability is essential for ensuring fairness, privacy, reliability, robustness, causality, and trust.

In other words, any situation where models can have a significant impact on the users, directly or indirectly. This includes individuals not using the model and the larger society. 

Feature engineering can assist with model interpretability, especially for supervised learning models working with structured, tabular data. Model interpretability tools are also important in fine-tuning feature engineering pipelines when combined with model evaluation. 

 Why is Feature Engineering (Still) Important

Why are data scientists still performing feature engineering by hand? Won’t manual feature engineering be replaced by deep learning and generative AI? What about AutoML? 

Supervised machine learning still dominates most industry use cases & deep learning can still benefit from feature engineering 

All models require high-quality datasets and manual feature engineering plays an important role in creating such datasets. 

And while deep learning models are becoming increasingly popular for automated feature learning, supervised learning models can still provide better results in many cases as well as being ubiquitous throughout industry.

Additionally, feature engineering and data preparation can be beneficial for deep learning pipelines as well. 

For example if a data scientist wanted to train a deep learning model from scratch for objective detection or language generation, they’d still want to ensure they have the best quality dataset possible and that might require creating a dataset through manual or automated data collection and hand-labeling (or labeling using weak supervision).  

While data preparation does involve writing code to “fill in holes” (imputation) or transform data types (like a timestamp to a datetime), feature engineering involves non-code tasks like understanding the dataset: how it was collected, the definitions of each of the columns, the complex structure of how tables in a database relate to each other. 

Whatever relationships a data science model can't automatically learn must be provided by the data scientist.

Even when a data scientist isn’t training a deep learning model from scratch, projects utilizing transfer learning will still benefit from high-quality datasets that are created through direct collecting, augmentation through transformations applied to the images or videos or text, and cleaning and filtering.

Automated feature engineering won’t replace data engineering or domain knowledge

Domain expertise is one factor as to why manual feature engineering will still be important, with or without help from automated feature engineering and feature learning methods.

A machine learning model is a learned mapping of inputs to a target (in supervised learning). A machine learning model won’t flag that the data set is wrong, columns are missing, or that the calculations for a particular problem are wrong. Machine learning models will take the dataset and labels as the source of the truth, where a human might be able to pick out potential sources of data leakage, duplications, bias, or distributions that “feel” wrong. 

An overlooked set of tasks a data scientist is responsible for is interfacing with the upstream data teams to request additional data integrations, source new data to augment their existing datasets, work with business teams to confirm logic and expected values, and to flag when there are systemic issues in the datasets (such as gaps in the dataset, because a backfilling operation failed). 

Unless AutoML and deep learning are able to take on the vital task of negotiating with live humans with multiple priorities and different systems, manual data preparation and feature engineering will still be important. 

AutoML methods can be expensive & manual feature engineering can be quicker, cheaper, & less likely to data dredge

A data scientist is working with a database with thousands of columns and hundreds of millions of rows.

Should they use AutoML exclusively to generate features or to replace the feature engineering cycle in their machine learning projects?

Reasons why they shouldn’t include:

  • Time – Unless the data scientist selects down the amount of data an AutoML solution would need to iterate through, the process might be incredibly time consuming, delaying any downstream processes such as evaluation, deployment, and A/B testing. 
  • Cost – If the data scientist doesn’t have access to sufficient compute and memory, the project will cost additional resources to scale.
  • Efficiency – Sometimes it’s faster if the data scientist performs manual engineering, or uses AutoML to supplement manual feature engineering. If a data scientist has an idea of the types of transformations they can create due to prior experiences or research, they should just start with those and then experiment with how different transforms impact the model performance.

Another important consideration is data dredging, finding spurious correlations within a dataset through over analysis. While data dredging is still possible with manual feature engineering and can be rampant in academic publishing, a data scientist can still pause their analysis and use interpretability tools to understand what features are driving the model prediction.

Essential Terminology

Before diving into the feature engineering cycle and the data scientist workflow, let’s define key terms used in the guide. 

Feature Engineering Cycles

Datasets versus ML Pipeline versus Data Source

  • Dataset – A structured collection of data, oftentimes in tabular format (although not always). During the data science lifecycle, source data datasets are eventually broken into subsets of the original dataset for training models as well as evaluating feature engineering pipelines. 
  • ML Pipeline – A sequence of steps that transform data into predictions or labels. An ML pipeline could encompass the early stage of the ML lifecycle (including data preparation, feature engineering, and model training) or could be decoupled into an offline training and online serving set of ML pipelines. 
  • Data Source – A data source is used to provide data for model training and serving. Data sources can be unstructured and raw or they can be structured and ready for modeling. Data sources can come directly from system logs and they can also include datasets created by other data scientists for prior projects.

Instances Versus Rows Versus Training Example

  • Instance – A single item (oftentimes a row in a dataset) or observation. Typically an instance is an item you’d like to send to your model for prediction. 
  • Training examples — Sometimes instances are called “training examples”, especially if they’re associated with features (and a label in supervised learning) and being used for training. 

Inputs versus Features

  • Input — Describes a single column in your dataset before it’s been processed.
  • Feature — Describes a single column after it has been processed.

Label versus Prediction versus Inference

  • Label – Also called “Ground truth”, in supervised machine learning projects the label will be used to train your model and also to evaluate the efficacy of your feature engineering pipeline.  
  • Prediction – Describes the process of using a machine learning model to guess the target value or label. For example, given a real estate dataset, you’d like to predict the value of a house.
  • Inference – Describes the process of understanding how the target value or label is being generated as a function of the corresponding dataset. For example, you’d like to understand how the different features in a real estate dataset (like location, crime rate, average neighborhood, proximity to industrial zones, etc) impacts the price of a house. 

Feature Definitions versus Feature Values vs Feature Engineering

  • Feature – A combination of the feature’s definition and values. 
  • Feature Definition — The logic and code that describes the feature, including the transformation that produced the feature values. 
  • Feature Values — The resulting output when a feature definition is applied to the corresponding input.
Features, Feature Definitions, & Feature Values

Feature Engineering & The Data Science Workflow

Feature engineering serves as the bridge between data and models, occupying a critical stage of the machine learning lifecycle. 

The process of feature engineering directly touches data science, business, product, and engineering due to the importance of subject matter expertise and problem definition in codifying business logic and processes as part of the data and feature processing pipelines. 

The Lifecycle of a Machine Learning Project

Let’s first understand where the model development phase sits in the lifecycle of a machine learning project.

Lifecycle of a Machine Learning Project

We’ll then return back to the model development phase and dig deeper into feature engineering, techniques and strategies. 

Phase 1: The Problem Definition & Project Scoping Phase
The Problem Definition & Project Scoping Phase

Although some organizations focus on research and development of new machine learning techniques and algorithms, a majority of data science and machine learning models today are developed for the following use cases: 

  • Cost cutting and optimization
  • Revenue generation.

Cost optimization and revenue generation can be directly or indirectly achieved through:

  • Launching new features or product avenues
  • Process automation
How Data Science Projects Get Started in Two Scenarios: Business Request Down & Dataset Up

Data science projects can be initiated through different means, including: 

  • Dataset-up: Either through exploration of a dataset or collection of datasets, data scientists identify trends or patterns that require deeper exploration and analysis. The findings indicate an opportunity for the company  to (directly or indirectly) increase revenue, cut costs, or both. 
  • Business Request-down: Business partners initiate a request for data science resources to be assigned to a problem (that can be well-defined and scoped to undefined with no established scope).
Data Science Project Matrix: Dataset Driven vs Problem Driven, Non-Urgent Exploration vs Business Critical Needs

Data scientists typically work closely with business and product partners, even if the project is self-initiated, to coordinate engineering resources (data, MLOps, frontend), and to ensure the feature or product is integrated into the company's portfolio.

Once a problem or project has been initially identified, the data scientist starts to work with their business partners (who could be the product owners, the finance team, the customer success team, etc) to define the business use case, formulate the problem as a machine learning or data science problem, and scoping the project requirements. 

The major questions a data scientist will need to answer include:

  • Data Requirements – What are the data requirements and needs for the project?
  • Model Requirements – What is the data science formulation of the problem and what types of models can be used to solve the formulation?
  • Serving Strategy – How will the model inferences be served to the end users? Is there a particular latency requirement that needs to be met? Will model inferences be served via pre-computed batch inferences (like a table), a live service for real-time inference, or as a unique pattern like an embedded model? 
  • Deployment Strategy – How will the model (or model pipeline) be deployed? Will the model go through A/B/n testing, rolled out gradually, or will some other deployment strategy be used?   
  • Maintenance Plan – How will the model be retrained? What happens if we need to rollback the model? How will the model be monitored? How can we collect and further enrich the data should the model need to be retrained? 

The data scientist will need to answer all these questions (and more) to understand, document, and coordinate the project and the relevant resources (including people).

Related activities that the data scientist will undertake during this phase include (but are not limited to): 

  • Creating a shortlist of what kind of data would be needed to train the machine learning  model
  • Discovering what data is available (and what data isn’t)
  • Understanding how the data was gathered and how it’s being maintained by who
  • Reviewing prior projects undertaken at the company
  • Searching the literature if the problem is novel or the data scientist doesn’t have ideas readily available
  • Communicating with teammates and colleagues for best practices, advice or sanity checking the project plan. 

In companies or organizations where data sources are disparate, knowledge is tribalized, and turnover is high, this phase of a data science project can take substantially longer than at companies where data is accessible, documented, well-understood, and clearly owned by a team accountable for the quality and knowledge transfer.

By the end of this phase a data scientist should have a project plan with a well-defined engineering architecture and specifications as well as a rough understanding of the project's feasibility.

Phase 3: The Model Deployment Phase
The Model Deployment Phase

A data scientist should exit the model development phase with a trained model, either in the form of a package, a library, a container, or as a job specification that helps pre-compute batch inferences. 

Given the challenges that machine learning models in production present (including consequences such as error handling, model performance, and unintended behavior) it would be wise for models to be deployed with intention and consideration and not always through YOLO (pushing to production as soon as the model is finalized with 100% rollout). 

What should be on the pre-flight checklist for model deployment?

At the beginning of the project, the data scientist should have started gathering information about requirements for: data requirements, model requirements, serving strategy, deployment strategy, and maintenance

Some of these requirements may change, especially for new models that don’t fit existing, supported patterns. Feasibility, changing product initiatives, and even external forces may change the implementation details of the model.

Regardless of the specific deployment pattern being used, the model should have met the following criteria before being deployed:

Model Deployment Checklist
  • Clear and well-defined product and model owners
  • Tests (unit, integration, end-to-end) of each layer of the model
    - Code
    - Data 
    - Models
  • Documentation for both current and future consumers and developers of the model
  • Versioning for repeatability, reproducibility, replicability
    - Model
    - Data
    - Code
  • Sign-off from product & legal teams
  • Model tracking with dataset and feature lineage

Assuming these conditions have been met, the model will then be deployed to either the staging environment or even to the production server. The model may be run offline (as part of testing and experimentation), online in a limited capacity with a percentage of traffic being routed to the model, or fully online. 

After this point the data scientist may still be engaged with the project in an ad-hoc manner.

Phase 4: The Model Maintenance Phase

Although testing in production for traditional software can be seen as a relatively risky proposition and as a sign of immaturity in some organizations (or even as a UX concern), machine learning products buck the trend. Not only do models need to be tested in production but they will inevitably become live tests for a number of reasons. For example, shifts in the data distribution may occur.

The Model Maintenance Phase

Production data may include edge cases that couldn’t be anticipated because the data didn’t previously exist before (for example, a bird watching application where the model was trained on historic data of native birds but suddenly an introduced invasive species suddenly starts appearing). 

Models may encode biases that are only identified in production because the bias has been exposed to a large number of users in a short period. 

A batch model pipeline supporting a popular site that was designed with a thousand users may suddenly spike because of a particularly viral campaign which drove 1 million users in a single day and the model may need to be quantized and the pipeline re-architected. 

Even if a model is operating under ideal conditions, change is inevitable, either due to external forces (such as changes in users and usage) as well internal forces (such as changes in production strategy and business operations). 

As there can be a number of ways to deliver value to the end user and they won’t be able to explicitly tell the difference between an implementation of an XGBoost versus a Random Forest model, the products and services themselves should be decoupled from the model pipeline such that the models can be retrained or rolled back from production by the data scientist (and related engineering teams).

Once a model has been deployed and is being actively monitored, the data scientist will then move on to the next project. 

Phase 2: Understanding The Model Development Phase

Now that we’ve covered the surrounding phases of the machine learning lifecycle, we return to the model development phase, where data scientists (should) occupy a majority of their time. 

Effective feature engineering is the “art” of the art & science of creating great machine learning products. 

The Model Development Phase: Dataset, Feature, & Model Engineering
The Data Cycle: Acquiring, Wrangling, & Exploring Data

The main goal during the data cycle is to craft the best possible dataset, which the data scientist will use as the basis for the feature engineering cycle.

The Dataset Engineering Cycle: Data Sources are prepared into a Dataset
The steps that comprise dataset engineering (not to be confused with the discipline of data engineering) are as follows:
  • Acquire and import data from data sources. Specifically, 
    1. Identify and locate data sources.
    2. Acquire data of interest from data sources, either manually or programmatically (using SQL, Python, or similar tools) and import into workspace.
  • Prepare data. This includes:
    1. Data cleaning
    2. Data transformation 
    3. Exploratory data analysis

The data cleaning, data transformation, and exploratory data analysis steps will be iterated through as the data scientist:

  • Examines the structure and characteristics of the dataset, attempting to understand amount of data, data types, number of columns or attributes, data quality issues;

  • Manually or programmatically applies transformations to the dataset for the purpose of addressing issues or creating new columns ;

  • Analyzes and inspects the transformed data through analytical, visual, or statistical methods, with the goal of understanding the quality of the data, identifying trends or patterns that could be beneficial (or detrimental) to engineering high-quality features, and surface any questions about the dataset that need to be answered. 

Techniques a data scientist may leverage during the exploratory data analysis step to identify and describe the dataset include:

  • Identification of outliers using box-plots
  • Describing the spread of values using density plots and histograms
  • Discovering bivariate relationships between candidate features and labels using scatter plots
  • Describing central tendency of a candidate feature (or features) by calculating the average, median, and standard deviation. 
  • And so on…

Additional questions a data scientist will try to answer during the dataset engineering phase include: 

  • What are the distributions of the potential feature candidates?
  • How many missing values are there and should they be handled? 
  • Are there outliers?
  • Are any input values highly correlated?
  • What features exist in the input data & which features should be engineered?
  • Is there enough data? How can we augment the dataset?
  • Is there bias in the dataset?

Based on the insights and answers to the questions above, there are a number of operations a data scientist can perform on a dataset to increase, decrease, or change the composition of the prepared dataset such as:

  • Labeling
  • Augmentation
  • Sampling 

For example, a data scientist may observe or identify that they don’t have labels (or the labels are incorrect) in the dataset. Labeling and annotation are important tasks that increase the number of examples a machine learning model can be trained on as well as open up additional feature candidates for feature engineering. Correct labeling can also mitigate bias and decrease noise. 

What if a data scientist observes they have too few training examples in their dataset or don’t have the variety of data that is necessary for describing the training example? 

Data scientists can augment their dataset by acquiring additional data, either through locating previously unknown datasets or even by scraping or accessing data banks. 

A data scientist can also change the composition of their dataset through sampling, an important especially when working with datasets that suffer from imbalanced classes for a classification problem.

Upsampling describes using techniques (like duplication or synthetic generation) to increase the representation of the minority class. 

Downsampling describes using techniques to decrease the representation of the majority class(es). 

There are a number of techniques a data scientist can use as long as they remember to split their dataset BEFORE using the sampling techniques to avoid leakage issues.

The dataset engineering phase results in four important outcomes:

  1. The prepared datasets will form the basis of the feature engineering pipelines that eventually get pushed to production. Data processing and feature engineering pipelines for production models will need to to take messy, real-world data and apply reproducible design patterns to ensure models are performant and resilient. The data scientist may begin writing a portion of that code during the dataset engineering step. 

  1. Based on how the dataset engineering step proceeds, the data scientist may gain insight into the data that changes the data science formulation of the business problem. 

  1. The data scientist may provide feedback to upstream data producers and teams about quality issues, unclear documentation, or data definitions. 

  1. The data scientist produces a transformed and prepared dataset that can now be used for feature engineering.

At the end of dataset preparation, the data scientist will perform train-test splitting before the feature engineering cycle. Feature engineering will take place on the training dataset, with the results of feature selection and iteration evaluated on the test and holdout sets.

A popular splitting strategy is allocating 80%-10%-10% of the dataset’s instances (or rows) to train-test-holdout sets, however there are other splitting schemes that can be used, especially for time-series data.  

The Feature Engineering Cycle

After a data scientist has prepared the dataset, the next step is to begin engineering features. 

The goal of the feature engineering cycle is to transform and select the highest signal set of features that will help the model learn the underlying patterns while not overfitting so much that the model is incapable of generalizing to new, unseen instances. 

The Feature Engineering Cycle: Dataset to Feature Engineering to Features & Labels

Feature engineering is a messy process that data scientists iterate through during the model development phase. At times data scientists might need to go back upstream to enrich their existing dataset or fix issues they’ve identified. 


Data scientists will also use the findings from the model training & evaluation stage to try new transformations or different subsets of features for the final model. 

Although a data scientist will use a mix of manual, programmatic, and algorithmic techniques during feature engineering, this stage is ultimately human-driven as most data scientists are using intuition derived through domain or subject matter expertise. Data scientists aren’t just looking for the best performing features, they’re also trying to understand the drivers of predictive power. 

  • Is there a chance that a feature performs poorly on its own but in combination with other features performs powerfully, due to measuring an interaction between features?
  • Is a feature performing well because of data leakage, a phenomenon where information about the true label has leaked into the training dataset? 
  • At the end of the feature engineering stage, the data scientist will have a prepared dataset with engineered features ready for splitting for training.

 The Model Training Cycle

The model training cycle marks the last stage of the mode development phase. 

During this stage data is passed to a model (or series of models) for training and evaluation.

While the goal of training is ensuring the model learns the necessary patterns to perform inference, the goal of evaluation is ensuring the model will generalize beyond the training set. 

The Model Training & Evaluation Cycle

The typical process (assuming all goes well) for a supervised learning model is as follows:

  1. Model Training: Train the model (or a set of model algorithms) on the training set.
  2. Model Evaluation: Evaluate the trained model(s) on the test set.
  3. Hyperparameter Tuning: Based on the performance of the model(s), pick the best one(s) and perform hyperparameter tuning.
  4. Model Validation: Take the tuned model and validate the model on the holdout or validation set. 
  5. Stop, Return, or Go: Based on the performance, either return to the feature engineering stage or the prior train-test-split to engineer or select new sets of features in case performance is still quite poor or promote the model(s) to the deployment phase if the model(s) successfully meet the desired offline performance criteria established by the data scientist in the beginning of the project.

Once a data scientist has finalized a trained model, it is containerized and deployed to production according to the organization's specifications.

At this point the data scientist should have engaged the necessary engineering and product resources for the following phases of the model development lifecycle and have an approved plan that covers:

  • Deployment – Will the model be A/B tested alongside other models? Will the model be fully rolled out immediately or will a different rollout strategy be utilized? 
  • Serving - Are we doing batch inference jobs, or real-time? Are there throughput or latency issues that we anticipate? How will those be handled? What about spiky workloads?
  • Retraining - How is it being managed, does retraining happen on new batches of data? Real time? Does retraining get triggered by changes in production model performance? 
  • Data - What data needs to be trained, served? Does the data processing and feature engineering pipeline need to be mirrored?

The Model Development Loop(s) 

The Model Development Phase can be the most challenging aspect of the machine learning lifecycle, on both the timeline of a project as well as on the patience of all involved parties (including data science, product, and engineering). 

While the lifecycle is depicted below as a relatively linear process overall, with dataset engineering, feature engineering, and model training depicted as internal cycles, the data scientist might still be forced to return to a prior step or cycle die to the experimental nature of data science.

Model Development Loops: Dataset Engineering Cycle, Feature Engineering Cycle, Model Training & Evalutation Cycle

For example, a data scientist could roughly prepare their dataset and features only to find that the models perform poorly due to a lack of data or because of issues in the upstream data sources. 

Data scientists might be required to work with a large dataset that’s poorly documented, tasked with the goal of winnowing the 1000’s of badly labeled columns or potential features to the most impactful 100 to avoid the curse of dimensionality. They may try to select and condense the various columns using techniques (that we’ll further explore in later sections) while quickly training and discarding temporary models based on the techniques being used.

Dataset and feature engineering remain the challenging “art” of data science and in the next section we’ll describe the various tools and techniques data scientists can use to craft effective features.

Feature Engineering Deep-Dive

“At the heart of any machine learning model is a mathematical function that is defined to operate on specific types of data only. At the same time, real-world machine learning models need to operate on data that may not be directly pluggable into the mathematical function.” – Valliappa Lakshmanan, Sara Robinson, Michael Munn (Machine Learning Design Patterns, O’Reilly)

Feature engineering is the vital link between data and models, as well as the data science and business teams within a company. 

Feature engineering is where assumptions about the business logic, the state of the data, and even a company’s appetite for machine learning products is tested. 

Once a data scientist has a prepared dataset, what practices and tools do they have at their disposal to engineer the highest quality features possible?

The Four Steps of Feature Engineering

The process of feature engineering comprises four steps. 

Data scientists can jump between, and iterate through, any of these steps as needed.

The Four Steps of Feature Engineering: Feature Transformation, Feature Extraction, Feature Selection, and Feature Learning

Feature Transformation 

Feature transformation is the most recognizable step within feature engineering. 

Feature transformation (also called “feature engineering”) involves creating new features by transforming existing features. 


For example a dataset could contain all purchases made during the last 12 months for a small e-commerce shop. Rather than using the timestamp of each purchase, the data scientist might care more about the date, day of the week, or time of the purchase. The data scientist could choose to use the timestamp in its original form (with some reformatting) or they could choose to create 3 new features or columns. 


Features can be transformed through simple mathematical calculations (such as subtracting or adding numeric quantities), through statistical procedures (such as calculating the distribution of a column of values), and using techniques such as 1-Hot encoding.

Feature Extraction

Feature extraction is the process of creating new features when we could not directly use the raw features.

Feature Extraction Techniques: Appearance-Based Approach, Feature-Based Approach, Template-Based Approach, Part-Based Approach

The boundary between feature transformation and feature extraction can be a little blurry.

For example some models are unable to handle NaNs, categorical variables or require all features to be of the same data type. Whereas other models might not have those same constraints on data types.

Quite often feature transformation and extraction are collectively referred to as “feature engineering”. 

Examples of techniques that are used include imputation (in the case of NaNs), encoding techniques, and other methods described later.

Two challenges can occur as a result of feature extraction and feature transformation: Feature explosion and the “Curse of Dimensionality”.

Feature explosion is when the number of identified features increases disproportionately to the actual desired number of features. This can be because data scientists are crossing or combining multiple columns or are templating features, thereby cheaply creating a large number of features. 

A large number of features can also push a dataset into the realm of high-dimensionality, invoking phenomena such as increased sparsity and decreased search efficiency and discoverability. This collection of phenomena is described as the Curse of Dimensionality

Examples of techniques that can be used to combat feature explosion include: regularization, kernel methods, and feature selection

Techniques used to combat the curse of dimensionality include reduction techniques such as PCA (principal component analysis). 

Feature Learning


Feature learning is the process of automatically constructing features. 

For example the creation of embeddings using deep learning models from video, image, and text. 

Common examples of feature learning include k-means clustering, independent component analysis, PCA, and multilayer neural networks. 

 

With the rise of deep learning, some have predicted that feature engineering pipelines would no longer be necessary. 

However, many organizations and data scientists still use feature engineering (extraction and transformation techniques) in conjunction with feature learning, for improved model interpretability and increased computational efficiency in live pipelines.

Feature Selection

Feature selection is the step in feature engineering that is most closely connected to the model training and evaluation cycle. 

Feature Selection Methods: Filters, Embedded Methods, Hybrid Methods, & Wrappers

A data scientist will iterate between: 

  • Creating different variations of datasets and features
  • Selecting different sets of features
  • Training a battery of models on those different variants
  • Using model performance metrics as a basis for determining hyperparameter tuning
  • Using interpretability tools to understand how specific features or sets of features contribute to a model’s prediction
  • And so on.

Why is feature selection and the ability to quickly group and ungroup features in sets necessary?

Generally more data should be a good thing, as adding more features (up to a point) will generally improve performance. And as a model lives in production, the number of available features will continue to increase alongside data maturity, documentation, and instrumentation. 

As the number of features in a dataset grows, so too do the chances of:

  • Data leakage
  • Overfitting
  • Computational requirements for serving
  • Prediction latency

Technical debt also increases as the number of features increase. 

A common data engineering horror story often involves an entire pipeline failing because of a malformed value entering a dataset as the model is retrained or an unanticipated value or data type is sent to the model in production for inference. 

Features can add complexity and cost without a commensurate increase in ROI and judicious selection of features is essential, especially when adding new features from datasets or sources where maintenance and support is questionable. 

The Lifecycle of a Feature

We’ve discussed how features are a necessary and yet costly component of machine learning pipelines. 

When done well, engineered features enable interpretability, versioning, and experimentation. 

Features can also contribute to technical debt as their definitions change, opportunities for weird exceptions and errors grow and features eventually become stale. 

What does the lifecycle of a feature look like?

Features Last As Long As They’re Useful

Lifecycle of a Feature: project planning, development, deployment, & maintenance

  1. Data Scientist brainstorms features wishlist for project based on the available domain and subject matter expertise, as well as existing insights on matching data availability
  2. Data acquired from various sources, pieced together at various levels of cardinality (i.e. transactions by date joined to customer further joined to corresponding marketing campaign). 
  3. Data is processed and transformed into centralized datasets.
  4. Data explored by data scientists for initial insights and to identify potential feature candidates.
  5. Feature definitions created based on current understanding of business (product, marketing, fiance, etc) definitions. 
  6. Feature definitions are applied to inputs to calculate feature values. Feature definitions are refined depending on whether feature values fall within data scientist expectations. 
  7. Datasets, features (definitions & values), and labels are documented and versioned to assist with experimentation and reproducibility. Great practices ensure that features (definitions and values) can be used in other projects, by other teams, and swapped without throwing away any work performed by the data scientist. 
  8. Feature values are served for Model Training & Evaluation.
  9. Features promoted or discarded depending on feature importance and generalizability.
  10. Feature definitions are mirrored from development to production, minimizing train-serve skew as much as possible.
  11. Data processing & feature definitions applied to production data & resulting features are served for inference.
  12. (Usually) Feature values are updated as new data comes in and models are retrained. Old feature values can be deleted or appended (then deleted after a period of time) depending on governance policies. 
  13. (Sometimes) Feature definitions are updated as business logic changes. 
  14. (Eventually) Discontinuation of a feature. Inputs and feature definitions are saved in case future data scientists would like to reproduce or use in the future.

Feature Importance & Feature Generalization

What makes a great feature? Why should a feature be promoted for use in training a model or for production?  

Features should meet two criteria: high importance and high generalizability. 

Feature Importance

Feature importance techniques try to capture how much each feature contributes to the model prediction. Conceptually this is accomplished by measuring how much a model’s performance deteriorates if the feature or set of features is removed from the model. The actual calculation and formulation depends on the algorithm and its implementation, with feature importance for a tree-based model calculated differently from a linear model, etc. 

Popular implementations of ML algorithms will have feature importance built-in, such as LightGBM, Catboost, RandomForests, and XGBoost. 

This is why a common method of initially measuring feature importance is to use an interpretable model that won’t necessarily become the final trained model to help inform initial feature selection. The data scientist will not only get an initial understanding of the stack rank of their features, they’ll also glimpse how much signal is truly present in their dataset. 

Feature importance scores can be calculated from correlation scores, coefficients (of linear & tree-based models), and permutation scores.  

An example of a calculation that can be used to measure feature importance is “mutual information”. 

Mutual information is used to measure associations between two quantities, such as features and labels. 

While similar to correlation, mutual information is more powerful because nonlinear relationships can also be detected and quantified (unlike correlation). Essentially mutual information tries to answer the question “How does having information about a specific feature reduce uncertainty about the label?”. 

Although mutual information can’t be used to detect interactions between features, it can still help you identify feature candidates and it’s easy to use as well as interpret. 

For models that don’t have feature interpretation methods built-in, there are other methods and techniques that are model-agnostic (i.e. specific algorithms aren’t required).

Model-agnostic techniques include Partial Dependence Plots, LIME (local interpretable model-agnostic explanation), as well as SHAP (SHaple Additive exPlanations). 

The goal of calculating feature importance is to help inform feature selection.

Feature Generalizability

While feature importance speaks to the impact of a feature, feature generalizability focuses on the main goal of most machine learning projects: performing well on future, unseen examples. 

Feature generalizability can be roughly estimated using two components: feature coverage and feature value distribution.

The higher the feature coverage, or the percentage of samples that have a value, the more generalizable the feature is. A feature that appears in a small percentage of samples is not going to generalize well unless there are systemic reasons why those values were missing. 

We also need to understand the distribution of the feature values. If the distributions of a feature’s values differ significantly between the train & test datasets then there’s a good chance they come from different distributions. A model trained on a dataset from one distribution is going to perform poorly in production regardless of the quality of feature engineering.

Importance & Generalizability Drive The Feature Engineering Cycle

Feature importance and generalizability are the main tools for understanding how a feature performs, and how it interacts with other features.

Data scientists use these tools to determine which features to transform, extract, select, or learn in order to achieve the desired model performance and quantify the relationship between features and predictions.

Feature importance tools can hint at the "how" & "why" of a feature relationship to the predictions, helping data scientists take the first steps towards explainability.

***************************************************************************************************************************

This is Part 1 of a 3 part Guide on Feature Engineering. Look out for our post on embeddings, and prompts! Interested in  Featureform's Feature Store and Orchestrator, check out our open-source repo!

Related Reading

From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.

even more resources
blue arrow pointing left
blue arrow pointing right