Scaling a machine learning organization is error-prone. There are very few successful examples to emulate. The problem lies in a lack of organization and standardization in the machine learning process. A single data scientist often tracks their notebooks, datasets, and production pipelines in an ad-hoc way. They may use loosely observed naming conventions, occasionally update documentation in a spreadsheet, or try to keep track of things in their head.
As a single data scientist scales into a larger organization, one of two things typically happens:
MLOps tooling promises to be the solution to these problems. It aims to fill the gap between DevOps and the data science workflow with a workflow that makes all parties happy. Data scientists can work with an intuitive workflow for machine learning; DevOps and DataOps teams can be sure models will run reliably; management knows that their teams can iterate quickly.
Great MLOps tooling must increase data scientist productivity for teams, from a single-person team to a large, globally distributed enterprise. In our experience, however, most MLOps tooling focuses on one end of the spectrum or the other.
Being valuable to both a single data scientist and an enterprise organization may seem like a tall order, but we’ve already seen that it’s possible within DevOps. Git is a prime example. It’s useful for a single software engineer trying to keep track of their changes, and, for a large organization, its abstractions can be built on for CI/CD, release management, and more.
Over the last few years, we’ve observed these different modes of adoption with feature stores. The ways small teams and big enterprises use them vary vastly. But, like Git, there is a unifying thread between all stages of adoption. The ability to build upon the underlying abstractions provided allows feature stores to organically scale to any size.
Machine learning models utilize context, concepts, and events to make inferences. The way you represent these concepts and events, from customer profiles to biological molecules, will define the effectiveness of an ML model. Feature engineering refers to the process of transforming data into useful representations, aka features, for your model to make inferences with. In short, a feature is an input to a model.
In many real world use cases, the components of a feature are scattered across different data infrastructure and files. A feature may start at a data source like an event stream in Kafka or a parquet file in S3. It may go through a series of transformations before being merged into a training dataset and perhaps cached for online inference. Without a feature store, all of the work of making sure all the components are consistent is left to the data scientist. This process is error-prone, slow, and can get very messy. We break down the components of a feature further in a different article.
A feature store allows a data scientist to stop thinking of features as rows in a database and transformations in Spark, but as named and versioned entities. Rather than joining multiple columns to create a table, they can zip together features and labels into a named training set.
There are three feature store architectures: the literal, physical, and virtual feature stores. For the rest of this article, we will only be talking about virtual feature stores. We think it's the only architecture that can be valuable across the stages of feature store abstraction.
A feature store allows users to define, manage, and share their features, labels, and training sets. This can be as simple as versioning for a single data scientist, and as complicated as multi-stage governance for large enterprises. Though the value of a feature store is providing a framework to define, manage, re-use, and serve features, the actual workflow varies dramatically across different size data science teams. The rest of this article will outline the five stages of feature store usage that we’ve seen.
The simplest machine learning “organization” is a single data scientist working locally on their laptop. They often work within notebooks, and spend much of their time analyzing and transforming their existing data sets. They train their models locally with the training set they generate, tune their hyper parameters, and eventually output a final file or maybe just print a few numbers.
At this stage, the feature store’s main function is to organize work. The feature store acts as a repository to store and version ML primitives like transformations, features, labels, and training sets. Rather than storing the data, the feature store stores the definition so it’s able to generate the primitives from scratch.
In this process, notebooks and feature stores work together hand and hand. Notebooks are where the majority of data science work gets done. It’s also used to document and share a data scientist’s learnings and past explorations. Tools like JupyterHub and Colab are used to keep track of notebooks and make it easier to share and run them. Within a notebook, a data scientist will find useful transformations, features, and training sets. The definitions of these resources can be pushed and centralized into a feature store. By doing this, a data scientist can then reference this feature by name, re-use it later in another project or notebook, share it with other data scientists, or regenerate it at a later point in time.
Store and organize transformations, features, and training sets in a purpose built repository
A feature store allows a data scientist to keep and document all their features, labels, and training sets in one place. Without a feature store, features are split across many different notebooks or files. They may be strewn across a file system with names like “model_final_final_v7.ipynb”. Feature stores centralize these resource definitions, and make them searchable and referenceable.
Keep a History of Changes
Feature engineering is an inherently iterative process. Features are constantly being created, changed, and deprecated during the exploratory phases. With a feature store, all of these updates are logged. This history of changes can be useful when a data scientist wants to roll back a feature, compare features with each other, or view how their training sets evolved over time.
Be able to progress to a larger organization
Implementing a feature store early on makes it easier to add more data scientists to a project later. It simplifies deploying features into production later. It also keeps all your projects at a similar level of organization and discipline.
The second stage of adoption still involves only a single data scientist, but now they deploy their models into production. At this stage, the feature store provides all the organizational value from the first stage, and also provides a standard workflow to serve and monitor their features in production. The monitoring and metrics are critical at this phase. The features are no longer used temporarily by a local model. These features must be continuously updated and ready to be served in production.
Experimentation still closely resembles the first stage. Data is analyzed and features are created and tested in notebooks. When features or transformations prove to be interesting or useful, they are pushed to the feature store for later use in other notebooks or experiments.
When a feature is ready to be deployed to production, it can be pushed to a set of data infrastructure providers. This allows experimental features and production features to exist and tandem and with a unified interface to access them.
A data scientist’s job does not end when a feature is in production, they have to maintain and monitor them. A feature store allows a data scientist to react to infrastructure failure, and to proactively handle feature drift before it causes adverse effects.
A feature store simplifies deploying features into production. A data scientist connects their data infrastructure providers and defines their transformations. The feature store then uses these definitions to get their infrastructure into the desired state. This increases feature reliability in production and decreases iteration time.
Once features are deployed, the feature store monitors them for both data drift and infrastructure degradation. This includes everything from throughput, latency and reachability on the infrastructure side. On the data side, distributions are monitored and anomalous data is flagged. This allows machine learning teams to proactively catch problems and maintain high model performance.
Unified Experimental and Production Features
A feature store unifies experimental and production features in a single repository. Rather than jumping between Airflow jobs and notebooks, feature definitions are grouped together logically and can be accessed via the same interface. This reduces the amount of context switching, results in less code rewriting for production, and ultimately increases productivity.
As more data scientists are added, and a team is formed, communication becomes critical. At this stage, all of the value around organization and deployment from the past stage increases with every newly added data scientist. Additionally, the collaboration features begin to shine. Since everything is standardized and centralized, machine learning resources can be reused. The value of machine learning work begins to compound.
In this workflow, the experimentation stage is enriched through search and discovery. All the different transformations, data sources, features, labels, and training sets that other data scientists on the team have created can be reused and expanded upon.
A data scientist can open the feature store dashboard and look through different models and data sources. They can answer questions like: “Which features does our fraud detection model use?” “How have those features changed over time?” “What data sources were those features created from?”. Having this ability makes onboarding data scientists far easier, and removes single points of failure that arise when depending on tribal knowledge.
Enhanced Communication & Collaboration
The feature store provides a universal language for machine learning resources that simplifies collaboration and communication across the team. By having an enforceable standard for defining features, you can guarantee that all features are named, versioned, and documented. Standardized definitions also make re-using another data scientist’s resources just as easy as using your own. Furthermore, you can derive new features from another team member’s existing resources. Because definitions are immutable, you can be sure that your upstream doesn’t break your features. This cuts down on code copying, and makes reuse safer.
Discover & Understand Each Other's Work
The feature store allows all the data scientists on a team to look through all the ML resources that an organization owns and the interdependencies between them. They can find the right data sources, features, and transformations they need to maximize their model’s performance. They can do all of this without having to set up endless meetings hoping to serendipitously land on these insights. This also keeps a team from having multiple data pipelines to create the same features.
Team-Wide Change Log
The feature store acts as a log of all changes made and all resources created on a team. This makes onboarding new data scientists far easier, and makes sure that no work is lost when others offboard. Rather than relying on tribal knowledge, a team can be sure that their process is safe from single points of failure.
Communication cost explodes when we move from a single team to multiple. At this point, a feature store moves from being valuable to being critical. The feature store acts as the data operating system between the different machine learning teams. As teams are formed and changed, machine learning resources will persist. As new data scientists come in, they have access to all the resources they need to onboard. As others leave, an organization can make sure that no work is lost. Resources like embeddings, transformations, and training sets can be exported from one team and safely re-used by others.
At this stage, data infrastructure begins to vary across teams. The feature store maintains a unified interface to define and use resources even across the heterogeneous infrastructure. Transformation code will change according to the underlying compute provider, but all the other metadata remains uniform.
This workflow begins with a machine learning organization registering all of their data infrastructure as providers in the feature store. As data scientists create features, they can specify where the transformations are run, and where the final features and training sets are stored. The feature store meshes together all of the heterogeneous infrastructure under one abstraction.
When a data scientist creates a feature, they should specify visibility. If all machine learning resources were globally visible, the resource catalogs would be too noisy to be useful. Some organizations have specific teams whose whole function is to create and export features and training sets for the rest of the organization to use. This is exceedingly common with things like embeddings.
Since transformation logic is immutable, data scientists can safely use another team's resources without worrying about their upstream changing. Furthermore, since some models and use-cases require different flavors of similar features, new variants can be created from existing ones. Those variants also have specific visibility, to avoid clogging up the namespace.
Hierarchical Search & Discovery
Some features are relevant only to a specific data scientist, others are relevant only to a single team, and some are relevant globally. Data scientists and teams can specify each resources’ visibility to optimize for discoverability across the organization.
As an organization grows to multiple teams working on different problem spaces, data scientists may require different variants of similar features. For example, at a social media company a “monthly revenue” feature that a finance team uses may differ from the one the ad serving team uses. Variants allow for data scientists to tailor their features for their use cases while maintaining the organizational value of having their metadata grouped together.
Handle Heterogeneous Infrastructure
Different teams in an organization will have different requirements of their infrastructure. Some will optimize for simplicity, some for correctness, and others for speed. A feature store allows teams to use the infrastructure that’s optimized for their use cases, while providing a unified interface for defining and serving the resources across it.
An enterprise machine learning organization typically comes with much stricter regulations and tighter standards. These standards are not uniform across different teams, geographies, and models. In the previous stage, data scientists would set visibility and access control to optimize for organization-wide productivity. In this stage, access control, visibility, and governance is paramount and far more complicated. It cannot be left to “best-effort” and requires a new role in the organization: the Feature Store Administrator.
Feature store administrators configure. They create a set of user roles, geographic rules, and model categories. They can also encode workflows and clearance checks. For example, a bank may require all features related to loan decisions to go through the legal department. Essentially, a feature store administrator encodes all enterprise standards and regulations into the feature store.
Data scientists can now use the feature store like they are used to, without having to memorize and implement all required regulation. The feature store will enforce it automatically. They won’t be able to see or use features they don’t have access to, and the feature store will fail to serve features to models that aren’t cleared for them. Other specialized workflows, like going through legal, will automatically be triggered by the feature store. Other than purposeful red tape, a data scientist can experiment and deploy features as fast as they could in the previous stage once the feature store is properly provisioned with governance rules.
An enterprise will have a variety of different rules around access control of their data and features. The rules may be based on the data scientist’s role, the model’s category, or dynamically based on a user’s input data (i.e. they are in Europe and subject to GDPR). All of these rules can be specified, and the feature store will enforce them.
Auditing and Compliance
The feature store acts as the bridge that connects data to models. It can log all changes, data access, and other information needed to maintain compliance.
In an enterprise environment, machine learning teams often interface with other teams outside of the data science department, like legal and marketing. The workflows required can be codified into the feature store to streamline the process and avoid any human error.
Feature stores are still a new category in the MLOps stack. All the feature stores on the market, and internally in organizations, are not full featured for all the stages of the adoption process. We’ve seen successful deployment and implementations of feature stores from single person teams to large enterprises. In the future, feature stores may be as ubiquitously used by data scientists as Git is by software engineers. There is still lots to build to get there, and at Featureform, we’re using the framework above to prioritize our roadmap from our open-source project to our enterprise solution.
From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.