Much like thrift shopping, machine learning modeling is an inherently iterative process with a lot of false starts and disappointment, made exciting by the occasional promise of a lucky find.
Unlike thrifters, data scientists must continuously iterate on features to improve the model’s performance. They can’t just throw their hands in the air and go “well, no vintage Chanel today!”.
Data scientists must continue pushing forward on their modeling efforts, training new models and tinkering with features to fine-tune model performance on previously deployed models.
Sometimes changes in the features improve a model’s predictions and sometimes they don’t and the model will need to be retrained (or rolled back to a prior model). Versioning is key to this process and facilitates reproducibility by ensuring a clear trail of changes.
Feature versioning offers significant benefits in the following areas:
Experiment Tracking: Versioning allows you to unlock the power of model experimentation in a systematic, structured way. When you experiment with different features and transformations, you need to be able to reproduce your steps accurately. Feature versioning of the transformation logic allows you to isolate logic changes because you’re able to match outcome to action i.e. did changing the business logic or the way the feature is calculated change between versions.
Repeatability & Reproducibility: Versioning helps ensure repeatability by allowing you to keep track of the various iterations of your experiments as well as the original inputs and transformation logic, and makes it easier to reproduce your work if needed. Feature versioning also enables reproducibility between environments and data scientists, ensuring that collaborators are able to see an exact snapshot of the feature state of the dataset using to train the model.
Collaboration and Communication: How can multiple team members work on the same project simultaneously and without conflicts? If you're working in a team, versioning becomes even more critical. It allows other data scientists to understand what changes you've made and why. Documentation through the use of comments, descriptions, tags and other metadata help preserve situational and contextual knowledge about the project for both current and future collaborators.
Eliminating Copy-Pasta: With versioning, you can see what you have done previously, thereby saving time and effort in re-running similar experiments or re-creating the same features. Additionally versioning allows you to use both logic and values directly, even sharing with project collaborators, thereby minimizing wheel reinvention. What data scientist hasn’t experienced toiling for days or weeks on a query or transformation pipeline only to find out that another team or teammate had a working copy already implemented?
Governance and Compliance: In heavily regulated industries like banking and insurance, there’s a good chance you’ll need to explain how features are built for regulatory purposes. Feature lineage provides a clean, auditable trail from raw data to features in production. Versioning provides a clear and auditable record of every change made to the models, the data, and the features. It ensures transparency and makes the feature lifecycle auditable, helping tick all the compliance boxes by giving a detailed account of how machine learning models are built and run. It also allows the exact replication of models - a real boon during audits or checks.
On top of that, versioning is key for effective model management. It simplifies tracking model evolution, ensuring smooth running and solid performance. And when it comes to explaining their models' decisions, banks and insurance companies can rely on versioning for a clear, transparent narrative.
While tools like Git and Data Version Control (DVC) have improved the reproducibility and trackability of code and data, they weren’t originally designed to manage feature versions in the specific context of a data science workflow.
"Variant" can be a more fitting term than "versioning" in the data science and machine learning arena.
Variants depict different forms of something coexisting, aligning more with the reality in machine learning where different dataset transformations or feature combinations aren't sequential versions, but parallel alternatives under simultaneous test and trial.
The term "variant" encapsulates the experimental ethos of machine learning and underscores the diversity between two feature sets or models, which can be vastly different rather than just incremental changes.
So while "versioning" is a staple in software development, "variant" more accurately conveys the rich diversity and non-linear progression of changes typical in the machine learning world.
Using variants, you can easily version your transformations and feature definitions.
In the example below, we’ll register a SQL-based transformation that takes a customer transaction dataset and computes the user’s average transaction.
We may want to experiment with the user’s average transaction in different windows.
In the code below, we can register the average_user_transaction with 30_day, 7_day, and 3_day variants.
Featureform supports versioning, not just for features and transformations, but all resources including:
using the same, simplified syntax shown above.
All resources allow you to set:
Have you started a new data science project or role, only to find out that in order for your pipeline to make it into production, you’d have to use a painfully assembled kludge of tools before writing a single line of SQL or Python?
Have you used a data science development tool that promised all the power of a production-grade library with the ease of a Kaggle notebook, only to feel lied to as you troubleshoot error after error trying to use their API in developing an MVP model?
Using Featureform’s auto-generated variants capability, data science practitioners can jump straight into developing and evaluating feature logic while experiencing the benefits of versioning.
Featureform’s auto-generated variants are similar to that of Docker and Github, where users can quickly create a repo or start a container without having to wrack their brains for a relevant, descriptive, and concise name.
Specifically, if resources (features, labels, training datasets, etc) don’t have a variant defined, Featureform will provide a randomly generated string as the variant name.
In order to reference an auto-generated variant, you can simply use the source name without referencing a variant.
For example: We can chain transformations and calculate statistics of the average_user_transaction without referencing a specific variant within the curly brackets.
What if you’d like to group and specifically reference variants from the same run through the script of your project?
Use the set_run() method to create named runs. Named runs can also make lineage tracing easier by allowing you to search for the variant name to get everything from that run.
By using auto-generated variants and setting session run names once, you’re less prone to mistyping or having to juggle names across notebook cells.
Just set it and forget it and feel confident that none of your work will be lost.
And by making it easy for data scientists to do the right thing of versioning their features and transformations and increasing collaboration, more time can be diverted to the hard, valuable work of developing new, innovative models and products.
When you’re finished defining your features and would like to materialize them for serving, either for training or for inference (or both!), you’ll register them with Featureform, apply the definitions for materialization, and then access the values.
We’ve named the run “experiment_jun_13”, which will be used when a variant hasn’t been provided (such as with the label) but we’ll register specific feature variants as “30_day”, “7_day”, and “3_day”.
We’ll zip together the features and labels in a training set that can be fetched to train a model.
What does the feature engineering and model development life cycle look like for an empowered data scientist using Featureform to version and document their features?
Feature versioning is crucial to unlocking the value of data for machine learning in a reproducible, collaborative manner that also supports governance and compliance.
Just as habits are the bedrock to success, seemingly innocuous best practices and capabilities like feature versioning can unlock production machine learning and data science that scales.
After this tutorial, you’ve developed a deeper understanding of:
From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.