Feature Engineering Fundamentals
Feature Engineering & The Data Science Workflow
Feature Engineering Deep-Dive
The Lifecycle of a Feature
How can we transform the data we’ve collected into valuable and performant machine learning products?
The answer: Through feature engineering.
Data can be generated and used to describe events, objects and concepts.
For example, data can include purchase transactions, an individual’s social media feed, or the current economic status of a country and its population.
Data can be gathered manually (e.g. collected during surveys) or automatically (e.g. through sensors, applications, etc) and can take different types and forms.
Data can loosely be organized in three categories of data structures: unstructured, semi-structured, and structured data.
Typically raw data can’t be used as a direct input to a machine learning model unless that raw form has been transformed and structured upstream already. Feature engineering is essential for data scientists and companies to create predictive models from everyday data.
The most important purpose of feature engineering is to improve the performance of a machine learning model or pipeline.
If a model is a meal, a result of the data used to train it, then features are the recipe that links the model back to the data. Data scientists attempt to derive and structure their datasets such that the model can optimally learn the relationships of the feature to targets. Not all features are created equal and the goal is to curate and create the subset of features that provide the greatest predictive power for a machine learning model.
Feature engineering is particularly helpful in projects where datasets are small (<10K) and as much information needs to be extracted as possible.
Effective feature engineering requires a combination of subject matter expertise, problem definition, exploratory data analysis, and iteration through the transformation-selection-evaluation cycle in order to achieve the best results.
Ultimately your goal is to transform your data into a structure that best represents the underlying problem that your machine learning algorithms are attempting to model.
A data scientist’s goal is to structure their data to model the underlying problem for their machine learning algorithms.
Effective feature engineering decreases computational & storage costs of the model as well as improves latency for both training models and serving predictions.
Reduced computational cost is due to reduced computational requirements, with ROI increasing along with application performance and user satisfaction.
Computational effectiveness is improved by feature engineering through:
In other words, reducing computational costs by using only the 20% of data and features that will drive the 80% of predictive power.
Machine learning models continue to be under scrutiny.
Model interpretability, defined as “the degree to which a human can consistently predict the model’s result”, is highly valuable and even required in many machine learning use cases.
Model interpretability is essential for ensuring fairness, privacy, reliability, robustness, causality, and trust.
In other words, any situation where models can have a significant impact on the users, directly or indirectly. This includes individuals not using the model and the larger society.
Feature engineering can assist with model interpretability, especially for supervised learning models working with structured, tabular data. Model interpretability tools are also important in fine-tuning feature engineering pipelines when combined with model evaluation.
Why are data scientists still performing feature engineering by hand? Won’t manual feature engineering be replaced by deep learning and generative AI? What about AutoML?
All models require high-quality datasets and manual feature engineering plays an important role in creating such datasets.
And while deep learning models are becoming increasingly popular for automated feature learning, supervised learning models can still provide better results in many cases as well as being ubiquitous throughout industry.
Additionally, feature engineering and data preparation can be beneficial for deep learning pipelines as well.
For example if a data scientist wanted to train a deep learning model from scratch for objective detection or language generation, they’d still want to ensure they have the best quality dataset possible and that might require creating a dataset through manual or automated data collection and hand-labeling (or labeling using weak supervision).
While data preparation does involve writing code to “fill in holes” (imputation) or transform data types (like a timestamp to a datetime), feature engineering involves non-code tasks like understanding the dataset: how it was collected, the definitions of each of the columns, the complex structure of how tables in a database relate to each other.
Whatever relationships a data science model can't automatically learn must be provided by the data scientist.
Even when a data scientist isn’t training a deep learning model from scratch, projects utilizing transfer learning will still benefit from high-quality datasets that are created through direct collecting, augmentation through transformations applied to the images or videos or text, and cleaning and filtering.
Domain expertise is one factor as to why manual feature engineering will still be important, with or without help from automated feature engineering and feature learning methods.
A machine learning model is a learned mapping of inputs to a target (in supervised learning). A machine learning model won’t flag that the data set is wrong, columns are missing, or that the calculations for a particular problem are wrong. Machine learning models will take the dataset and labels as the source of the truth, where a human might be able to pick out potential sources of data leakage, duplications, bias, or distributions that “feel” wrong.
An overlooked set of tasks a data scientist is responsible for is interfacing with the upstream data teams to request additional data integrations, source new data to augment their existing datasets, work with business teams to confirm logic and expected values, and to flag when there are systemic issues in the datasets (such as gaps in the dataset, because a backfilling operation failed).
Unless AutoML and deep learning are able to take on the vital task of negotiating with live humans with multiple priorities and different systems, manual data preparation and feature engineering will still be important.
A data scientist is working with a database with thousands of columns and hundreds of millions of rows.
Should they use AutoML exclusively to generate features or to replace the feature engineering cycle in their machine learning projects?
Reasons why they shouldn’t include:
Another important consideration is data dredging, finding spurious correlations within a dataset through over analysis. While data dredging is still possible with manual feature engineering and can be rampant in academic publishing, a data scientist can still pause their analysis and use interpretability tools to understand what features are driving the model prediction.
Before diving into the feature engineering cycle and the data scientist workflow, let’s define key terms used in the guide.
Feature engineering serves as the bridge between data and models, occupying a critical stage of the machine learning lifecycle.
The process of feature engineering directly touches data science, business, product, and engineering due to the importance of subject matter expertise and problem definition in codifying business logic and processes as part of the data and feature processing pipelines.
Let’s first understand where the model development phase sits in the lifecycle of a machine learning project.
We’ll then return back to the model development phase and dig deeper into feature engineering, techniques and strategies.
Although some organizations focus on research and development of new machine learning techniques and algorithms, a majority of data science and machine learning models today are developed for the following use cases:
Cost optimization and revenue generation can be directly or indirectly achieved through:
Data science projects can be initiated through different means, including:
Data scientists typically work closely with business and product partners, even if the project is self-initiated, to coordinate engineering resources (data, MLOps, frontend), and to ensure the feature or product is integrated into the company's portfolio.
Once a problem or project has been initially identified, the data scientist starts to work with their business partners (who could be the product owners, the finance team, the customer success team, etc) to define the business use case, formulate the problem as a machine learning or data science problem, and scoping the project requirements.
The major questions a data scientist will need to answer include:
The data scientist will need to answer all these questions (and more) to understand, document, and coordinate the project and the relevant resources (including people).
Related activities that the data scientist will undertake during this phase include (but are not limited to):
In companies or organizations where data sources are disparate, knowledge is tribalized, and turnover is high, this phase of a data science project can take substantially longer than at companies where data is accessible, documented, well-understood, and clearly owned by a team accountable for the quality and knowledge transfer.
By the end of this phase a data scientist should have a project plan with a well-defined engineering architecture and specifications as well as a rough understanding of the project's feasibility.
A data scientist should exit the model development phase with a trained model, either in the form of a package, a library, a container, or as a job specification that helps pre-compute batch inferences.
Given the challenges that machine learning models in production present (including consequences such as error handling, model performance, and unintended behavior) it would be wise for models to be deployed with intention and consideration and not always through YOLO (pushing to production as soon as the model is finalized with 100% rollout).
What should be on the pre-flight checklist for model deployment?
At the beginning of the project, the data scientist should have started gathering information about requirements for: data requirements, model requirements, serving strategy, deployment strategy, and maintenance.
Some of these requirements may change, especially for new models that don’t fit existing, supported patterns. Feasibility, changing product initiatives, and even external forces may change the implementation details of the model.
Regardless of the specific deployment pattern being used, the model should have met the following criteria before being deployed:
Assuming these conditions have been met, the model will then be deployed to either the staging environment or even to the production server. The model may be run offline (as part of testing and experimentation), online in a limited capacity with a percentage of traffic being routed to the model, or fully online.
After this point the data scientist may still be engaged with the project in an ad-hoc manner.
Although testing in production for traditional software can be seen as a relatively risky proposition and as a sign of immaturity in some organizations (or even as a UX concern), machine learning products buck the trend. Not only do models need to be tested in production but they will inevitably become live tests for a number of reasons. For example, shifts in the data distribution may occur.
Production data may include edge cases that couldn’t be anticipated because the data didn’t previously exist before (for example, a bird watching application where the model was trained on historic data of native birds but suddenly an introduced invasive species suddenly starts appearing).
Models may encode biases that are only identified in production because the bias has been exposed to a large number of users in a short period.
A batch model pipeline supporting a popular site that was designed with a thousand users may suddenly spike because of a particularly viral campaign which drove 1 million users in a single day and the model may need to be quantized and the pipeline re-architected.
Even if a model is operating under ideal conditions, change is inevitable, either due to external forces (such as changes in users and usage) as well internal forces (such as changes in production strategy and business operations).
As there can be a number of ways to deliver value to the end user and they won’t be able to explicitly tell the difference between an implementation of an XGBoost versus a Random Forest model, the products and services themselves should be decoupled from the model pipeline such that the models can be retrained or rolled back from production by the data scientist (and related engineering teams).
Once a model has been deployed and is being actively monitored, the data scientist will then move on to the next project.
Now that we’ve covered the surrounding phases of the machine learning lifecycle, we return to the model development phase, where data scientists (should) occupy a majority of their time.
Effective feature engineering is the “art” of the art & science of creating great machine learning products.
The main goal during the data cycle is to craft the best possible dataset, which the data scientist will use as the basis for the feature engineering cycle.
The data cleaning, data transformation, and exploratory data analysis steps will be iterated through as the data scientist:
Techniques a data scientist may leverage during the exploratory data analysis step to identify and describe the dataset include:
Additional questions a data scientist will try to answer during the dataset engineering phase include:
Based on the insights and answers to the questions above, there are a number of operations a data scientist can perform on a dataset to increase, decrease, or change the composition of the prepared dataset such as:
For example, a data scientist may observe or identify that they don’t have labels (or the labels are incorrect) in the dataset. Labeling and annotation are important tasks that increase the number of examples a machine learning model can be trained on as well as open up additional feature candidates for feature engineering. Correct labeling can also mitigate bias and decrease noise.
What if a data scientist observes they have too few training examples in their dataset or don’t have the variety of data that is necessary for describing the training example?
Data scientists can augment their dataset by acquiring additional data, either through locating previously unknown datasets or even by scraping or accessing data banks.
A data scientist can also change the composition of their dataset through sampling, an important especially when working with datasets that suffer from imbalanced classes for a classification problem.
Upsampling describes using techniques (like duplication or synthetic generation) to increase the representation of the minority class.
Downsampling describes using techniques to decrease the representation of the majority class(es).
There are a number of techniques a data scientist can use as long as they remember to split their dataset BEFORE using the sampling techniques to avoid leakage issues.
The dataset engineering phase results in four important outcomes:
At the end of dataset preparation, the data scientist will perform train-test splitting before the feature engineering cycle. Feature engineering will take place on the training dataset, with the results of feature selection and iteration evaluated on the test and holdout sets.
A popular splitting strategy is allocating 80%-10%-10% of the dataset’s instances (or rows) to train-test-holdout sets, however there are other splitting schemes that can be used, especially for time-series data.
After a data scientist has prepared the dataset, the next step is to begin engineering features.
The goal of the feature engineering cycle is to transform and select the highest signal set of features that will help the model learn the underlying patterns while not overfitting so much that the model is incapable of generalizing to new, unseen instances.
Feature engineering is a messy process that data scientists iterate through during the model development phase. At times data scientists might need to go back upstream to enrich their existing dataset or fix issues they’ve identified.
Data scientists will also use the findings from the model training & evaluation stage to try new transformations or different subsets of features for the final model.
Although a data scientist will use a mix of manual, programmatic, and algorithmic techniques during feature engineering, this stage is ultimately human-driven as most data scientists are using intuition derived through domain or subject matter expertise. Data scientists aren’t just looking for the best performing features, they’re also trying to understand the drivers of predictive power.
The model training cycle marks the last stage of the mode development phase.
During this stage data is passed to a model (or series of models) for training and evaluation.
While the goal of training is ensuring the model learns the necessary patterns to perform inference, the goal of evaluation is ensuring the model will generalize beyond the training set.
The typical process (assuming all goes well) for a supervised learning model is as follows:
Once a data scientist has finalized a trained model, it is containerized and deployed to production according to the organization's specifications.
At this point the data scientist should have engaged the necessary engineering and product resources for the following phases of the model development lifecycle and have an approved plan that covers:
The Model Development Phase can be the most challenging aspect of the machine learning lifecycle, on both the timeline of a project as well as on the patience of all involved parties (including data science, product, and engineering).
While the lifecycle is depicted below as a relatively linear process overall, with dataset engineering, feature engineering, and model training depicted as internal cycles, the data scientist might still be forced to return to a prior step or cycle die to the experimental nature of data science.
For example, a data scientist could roughly prepare their dataset and features only to find that the models perform poorly due to a lack of data or because of issues in the upstream data sources.
Data scientists might be required to work with a large dataset that’s poorly documented, tasked with the goal of winnowing the 1000’s of badly labeled columns or potential features to the most impactful 100 to avoid the curse of dimensionality. They may try to select and condense the various columns using techniques (that we’ll further explore in later sections) while quickly training and discarding temporary models based on the techniques being used.
Dataset and feature engineering remain the challenging “art” of data science and in the next section we’ll describe the various tools and techniques data scientists can use to craft effective features.
“At the heart of any machine learning model is a mathematical function that is defined to operate on specific types of data only. At the same time, real-world machine learning models need to operate on data that may not be directly pluggable into the mathematical function.” – Valliappa Lakshmanan, Sara Robinson, Michael Munn (Machine Learning Design Patterns, O’Reilly)
Feature engineering is the vital link between data and models, as well as the data science and business teams within a company.
Feature engineering is where assumptions about the business logic, the state of the data, and even a company’s appetite for machine learning products is tested.
Once a data scientist has a prepared dataset, what practices and tools do they have at their disposal to engineer the highest quality features possible?
The process of feature engineering comprises four steps.
Data scientists can jump between, and iterate through, any of these steps as needed.
Feature transformation is the most recognizable step within feature engineering.
Feature transformation (also called “feature engineering”) involves creating new features by transforming existing features.
For example a dataset could contain all purchases made during the last 12 months for a small e-commerce shop. Rather than using the timestamp of each purchase, the data scientist might care more about the date, day of the week, or time of the purchase. The data scientist could choose to use the timestamp in its original form (with some reformatting) or they could choose to create 3 new features or columns.
Features can be transformed through simple mathematical calculations (such as subtracting or adding numeric quantities), through statistical procedures (such as calculating the distribution of a column of values), and using techniques such as 1-Hot encoding.
Feature extraction is the process of creating new features when we could not directly use the raw features.
The boundary between feature transformation and feature extraction can be a little blurry.
For example some models are unable to handle NaNs, categorical variables or require all features to be of the same data type. Whereas other models might not have those same constraints on data types.
Quite often feature transformation and extraction are collectively referred to as “feature engineering”.
Examples of techniques that are used include imputation (in the case of NaNs), encoding techniques, and other methods described later.
Two challenges can occur as a result of feature extraction and feature transformation: Feature explosion and the “Curse of Dimensionality”.
Feature explosion is when the number of identified features increases disproportionately to the actual desired number of features. This can be because data scientists are crossing or combining multiple columns or are templating features, thereby cheaply creating a large number of features.
A large number of features can also push a dataset into the realm of high-dimensionality, invoking phenomena such as increased sparsity and decreased search efficiency and discoverability. This collection of phenomena is described as the Curse of Dimensionality.
Examples of techniques that can be used to combat feature explosion include: regularization, kernel methods, and feature selection.
Techniques used to combat the curse of dimensionality include reduction techniques such as PCA (principal component analysis).
Feature learning is the process of automatically constructing features.
For example the creation of embeddings using deep learning models from video, image, and text.
Common examples of feature learning include k-means clustering, independent component analysis, PCA, and multilayer neural networks.
With the rise of deep learning, some have predicted that feature engineering pipelines would no longer be necessary.
However, many organizations and data scientists still use feature engineering (extraction and transformation techniques) in conjunction with feature learning, for improved model interpretability and increased computational efficiency in live pipelines.
Feature selection is the step in feature engineering that is most closely connected to the model training and evaluation cycle.
A data scientist will iterate between:
Why is feature selection and the ability to quickly group and ungroup features in sets necessary?
Generally more data should be a good thing, as adding more features (up to a point) will generally improve performance. And as a model lives in production, the number of available features will continue to increase alongside data maturity, documentation, and instrumentation.
As the number of features in a dataset grows, so too do the chances of:
Technical debt also increases as the number of features increase.
A common data engineering horror story often involves an entire pipeline failing because of a malformed value entering a dataset as the model is retrained or an unanticipated value or data type is sent to the model in production for inference.
Features can add complexity and cost without a commensurate increase in ROI and judicious selection of features is essential, especially when adding new features from datasets or sources where maintenance and support is questionable.
We’ve discussed how features are a necessary and yet costly component of machine learning pipelines.
When done well, engineered features enable interpretability, versioning, and experimentation.
Features can also contribute to technical debt as their definitions change, opportunities for weird exceptions and errors grow and features eventually become stale.
What does the lifecycle of a feature look like?
What makes a great feature? Why should a feature be promoted for use in training a model or for production?
Features should meet two criteria: high importance and high generalizability.
Feature Importance
Feature importance techniques try to capture how much each feature contributes to the model prediction. Conceptually this is accomplished by measuring how much a model’s performance deteriorates if the feature or set of features is removed from the model. The actual calculation and formulation depends on the algorithm and its implementation, with feature importance for a tree-based model calculated differently from a linear model, etc.
Popular implementations of ML algorithms will have feature importance built-in, such as LightGBM, Catboost, RandomForests, and XGBoost.
This is why a common method of initially measuring feature importance is to use an interpretable model that won’t necessarily become the final trained model to help inform initial feature selection. The data scientist will not only get an initial understanding of the stack rank of their features, they’ll also glimpse how much signal is truly present in their dataset.
Feature importance scores can be calculated from correlation scores, coefficients (of linear & tree-based models), and permutation scores.
An example of a calculation that can be used to measure feature importance is “mutual information”.
Mutual information is used to measure associations between two quantities, such as features and labels.
While similar to correlation, mutual information is more powerful because nonlinear relationships can also be detected and quantified (unlike correlation). Essentially mutual information tries to answer the question “How does having information about a specific feature reduce uncertainty about the label?”.
Although mutual information can’t be used to detect interactions between features, it can still help you identify feature candidates and it’s easy to use as well as interpret.
For models that don’t have feature interpretation methods built-in, there are other methods and techniques that are model-agnostic (i.e. specific algorithms aren’t required).
Model-agnostic techniques include Partial Dependence Plots, LIME (local interpretable model-agnostic explanation), as well as SHAP (SHaple Additive exPlanations).
The goal of calculating feature importance is to help inform feature selection.
Feature Generalizability
While feature importance speaks to the impact of a feature, feature generalizability focuses on the main goal of most machine learning projects: performing well on future, unseen examples.
Feature generalizability can be roughly estimated using two components: feature coverage and feature value distribution.
The higher the feature coverage, or the percentage of samples that have a value, the more generalizable the feature is. A feature that appears in a small percentage of samples is not going to generalize well unless there are systemic reasons why those values were missing.
We also need to understand the distribution of the feature values. If the distributions of a feature’s values differ significantly between the train & test datasets then there’s a good chance they come from different distributions. A model trained on a dataset from one distribution is going to perform poorly in production regardless of the quality of feature engineering.
Feature importance and generalizability are the main tools for understanding how a feature performs, and how it interacts with other features.
Data scientists use these tools to determine which features to transform, extract, select, or learn in order to achieve the desired model performance and quantify the relationship between features and predictions.
Feature importance tools can hint at the "how" & "why" of a feature relationship to the predictions, helping data scientists take the first steps towards explainability.
***************************************************************************************************************************
This is Part 1 of a 3 part Guide on Feature Engineering. Look out for our post on embeddings, and prompts! Interested in Featureform's Feature Store and Orchestrator, check out our open-source repo!
See what a virtual feature store means for your organization.