Feature Store Tutorial: Feature Versioning 101

Shabnam Mokhtarani
July 21, 2023

Table of Contents

The Importance of Feature Versioning In Machine Learning

Why Is Feature Versioning Important?

ML Models Need Feature Versioning, Not Data Versioning

How Featureform Supports Versioning With Variants

Auto-Generated Variants: Enabling Ease-of-Use With Featureform

The Recommended Workflow From Experimentation To Production

The Importance of Feature Versioning In Machine Learning

Much like thrift shopping, machine learning modeling is an inherently iterative process with a lot of false starts and disappointment, made exciting by the occasional promise of a lucky find.

Unlike thrifters, data scientists must continuously iterate on features to improve the model’s performance. They can’t just throw their hands in the air and go “well, no vintage Chanel today!”.

Data scientists must continue pushing forward on their modeling efforts, training new models and tinkering with features to fine-tune model performance on previously deployed models. 

Sometimes changes in the features improve a model’s predictions and sometimes they don’t and the model will need to be retrained (or rolled back to a prior model). Versioning is key to this process and facilitates reproducibility by ensuring a clear trail of changes.

Why Is Feature Versioning Important?

Feature versioning offers significant benefits in the following areas:

Experiment Tracking: Versioning allows you to unlock the power of model experimentation in a systematic, structured way. When you experiment with different features and transformations, you need to be able to reproduce your steps accurately. Feature versioning of the transformation logic allows you to isolate logic changes because you’re able to match outcome to action i.e. did changing the business logic or the way the feature is calculated change between versions. 

Repeatability & Reproducibility: Versioning helps ensure repeatability by allowing you to keep track of the various iterations of your experiments as well as the original inputs and transformation logic, and makes it easier to reproduce your work if needed. Feature versioning also enables reproducibility between environments and data scientists, ensuring that collaborators are able to see an exact snapshot of the feature state of the dataset using to train the model. 

Collaboration and Communication:  How can multiple team members work on the same project simultaneously and without conflicts? If you're working in a team, versioning becomes even more critical. It allows other data scientists to understand what changes you've made and why. Documentation through the use of comments, descriptions, tags and other metadata help preserve situational and contextual knowledge about the project for both current and future collaborators. 

Eliminating Copy-Pasta: With versioning, you can see what you have done previously, thereby saving time and effort in re-running similar experiments or re-creating the same features. Additionally versioning allows you to use both logic and values directly, even sharing with project collaborators, thereby minimizing wheel reinvention. What data scientist hasn’t experienced toiling for days or weeks on a query or transformation pipeline only to find out that another team or teammate had a working copy already implemented? 

Governance and Compliance: In heavily regulated industries like banking and insurance, there’s a good chance you’ll need to explain how features are built for regulatory purposes. Feature lineage provides a clean, auditable trail from raw data to features in production. Versioning provides a clear and auditable record of every change made to the models, the data, and the features. It ensures transparency and makes the feature lifecycle auditable, helping tick all the compliance boxes by giving a detailed account of how machine learning models are built and run. It also allows the exact replication of models - a real boon during audits or checks. 

On top of that, versioning is key for effective model management. It simplifies tracking model evolution, ensuring smooth running and solid performance. And when it comes to explaining their models' decisions, banks and insurance companies can rely on versioning for a clear, transparent narrative.

ML Models Need Feature Versioning, Not Data Versioning 

While tools like Git and Data Version Control (DVC) have improved the reproducibility and trackability of code and data, they weren’t originally designed to manage feature versions in the specific context of a data science workflow.

Versioning Features with Git

  • Poor Scalability: Git was designed to version control source code and not large or unstructured data files. When repositories become too large it can become slow or even fail. This is problematic for data science projects that frequently involve large datasets or  binary data (like images or model weights), which is common in data science projects.

  • No Data Lineage: Git is not designed to track the lineage of data, i.e., the process by which data is transformed and used in various parts of your workflow. Git doesn’t encode the DAG that creates the features, which results in a loss of metadata and the dependency chain. This is a critical feature in data science projects. Feature versioning requires versioning not just the feature but all of its lineage (the steps from source to create it) so it can be recreated at any time. This is a cleaner way to manage features than treating every intermediary or every feature as its own dataset, which neither Git nor DVC support.

  • Poor Version Juggling: Truthiness with regards to features and training sets resembles the truthiness of the multiverse and less so a singular Source-of-Truth. Often we have models using different versions of features, especially in scenarios where multiple models are deployed for experimentation and testing. Do we put those all in the same file? Is each one in a different file? If a version is no longer used, do we delete the file in git? Do we just maintain a file for every version of the feature to ever exist? The answers are unclear and if you tried to solve it you'd end up re-inventing a super scaled down version of a feature store. 

Versioning Features with DVC

  • Lack of Full Integration: DVC is an add-on to Git and not natively integrated into it. This can sometimes lead to confusion and inconsistencies. For example, DVC does not natively support Git's branching features, and you have to explicitly checkout data when you switch branches. Switching between versions is challenging and so is using multiple versions of data at the same time. 

  • Minimal Metadata Management: DVC does well to version control datasets, but it does not provide a robust mechanism for versioning features and transformations. This is especially true when dealing with features that are created through complex pipelines or workflows.

  • Resource (In)Efficient: Because DVC is unaware that features are often very similar and may even derive from the same dataset, it treats everything differently. It doesn't create a DAG and optimize based on the differences in logic between the feature versions. It simply treats every feature and transformation as a brand new dataset with no tie to the other.

How Featureform Supports Versioning With Variants

Why Say Variants Instead of Versions? 

"Variant" can be a more fitting term than "versioning" in the data science and machine learning arena.

 

Here's why:

Variants depict different forms of something coexisting, aligning more with the reality in machine learning where different dataset transformations or feature combinations aren't sequential versions, but parallel alternatives under simultaneous test and trial.

The term "variant" encapsulates the experimental ethos of machine learning and underscores the diversity between two feature sets or models, which can be vastly different rather than just incremental changes. 

So while "versioning" is a staple in software development, "variant" more accurately conveys the rich diversity and non-linear progression of changes typical in the machine learning world.

Tracking Versions: Setting Variants In Featureform

Using variants, you can easily version your transformations and feature definitions. 

Creating a new transformation with an initial variant

In the example below, we’ll register a SQL-based transformation that takes a customer transaction dataset and computes the user’s average transaction.

Creating additional variants of an existing transformation 

We may want to experiment with the user’s average transaction in different windows. 

In the code below, we can register the average_user_transaction with 30_day, 7_day, and 3_day variants.

Featureform supports versioning, not just for features and transformations, but all resources including

using the same, simplified syntax shown above. 

All resources allow you to set:

  • Name - In case you’d like a clear, descriptive name for your resource.
  • Variant - A version you can refer back to for tracking and lineage.
  • Description - A string that is displayed (along with all the other metadata about the resource) in the dashboard.
  • Tags - Metadata tags can either be a list (tags), a set of key-value pairs (properties), or both that can be used to add additional grouping of resources.

Auto-Generated Variants: Enabling Ease-of-Use With Featureform

The Power of Versioning Without The Manual Effort

Have you started a new data science project or role, only to find out that in order for your pipeline to make it into production, you’d have to use a painfully assembled kludge of tools before writing a single line of SQL or Python? 

Have you used a data science development tool that promised all the power of a production-grade library with the ease of a Kaggle notebook, only to feel lied to as you troubleshoot error after error trying to use their API in developing an MVP model? 

Using Featureform’s auto-generated variants capability, data science practitioners can jump straight into developing and evaluating feature logic while experiencing the benefits of versioning. 

What Do We Mean By Auto-Generated Variants?

Featureform’s auto-generated variants are similar to that of Docker and Github, where users can quickly create a repo or start a container without having to wrack their brains for a relevant, descriptive, and concise name. 

Specifically, if resources (features, labels, training datasets, etc) don’t have a variant defined, Featureform will provide a randomly generated string as the variant name.

Referencing An Auto-Generated Variant

In order to reference an auto-generated variant, you can simply use the source name without referencing a variant.

For example: We can chain transformations and calculate statistics of the average_user_transaction without referencing a specific variant within the curly brackets. 

Organizing Feature Development With Named Runs

What if you’d like to group and specifically reference variants from the same run through the script of your project?

Referencing Named Runs                                                        

Use the set_run() method to create named runs. Named runs can also make lineage tracing easier by allowing you to search for the variant name to get everything from that run. 

By using auto-generated variants and setting session run names once, you’re less prone to mistyping or having to juggle names across notebook cells. 

Just set it and forget it and feel confident that none of your work will be lost. 

And by making it easy for data scientists to do the right thing of versioning their features and transformations and increasing collaboration, more time can be diverted to the hard, valuable work of developing new, innovative models and products. 

Materializing Features For Serving: Registering & Applying

When you’re finished defining your features and would like to materialize them for serving, either for training or for inference (or both!), you’ll register them with Featureform, apply the definitions for materialization, and then access the values.

Register Features

We’ve named the run “experiment_jun_13”, which will be used when a variant hasn’t been provided (such as with the label) but we’ll register specific feature variants as “30_day”, “7_day”, and “3_day”. 


Register Training Set

We’ll zip together the features and labels in a training set that can be fetched to train a model.

Apply Then Serve

We’ll use the apply command to materialize the resources that have been defined previously. Then we’ll serve the features for model training

The Recommended Workflow From Experimentation To Production

What does the feature engineering and model development life cycle look like for an empowered data scientist using Featureform to version and document their features?

How to Use Featureform To Support Versioning

  • During the development and experimentation stage, use Fetaureform’s auto-variants capabilities to jump straight into the hard work of analyzing, understanding, and generating features from your data sources. Feel confident that with minimal manual set-up and configuration, your feature logic and values will be versioned in case you need to put your project down for a bit. 

  • If you like a bit more structure and better grouping, create a named run using Featureform’s set_run() method. 

  • Once you’re confident in your feature logic and would like to materialize the features for model training, serving, and experimentation, set the individual variant names for tracking and lineage. 

Conclusion

Feature versioning is crucial to unlocking the value of data for machine learning in a reproducible, collaborative manner that also supports governance and compliance.

Just as habits are the bedrock to success, seemingly innocuous best practices and capabilities like feature versioning can unlock production machine learning and data science that scales.

After this tutorial, you’ve developed a deeper understanding of:

  • Why feature versioning matters;
  • Why data & code versioning tools (like DVC & Git) aren’t well suited for versioning machine learning features;
  • How open-source Featureform supports the data science workflow;
  • How to bridge the dev-prod gap using a feature store like Featureform. 

Related Reading

From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.

even more resources
blue arrow pointing left
blue arrow pointing right