Platform initiatives, while valuable, can be expensive & complex undertakings that end up failing, costing data scientists & engineers time (& companies money).
Platform initiatives also take shape & grow over a long period of time, which can make measuring the impact of platform initiatives incredibly challenging.
Ultimately ML platforms are meant to improve the velocity, volume, & value of data science work.
And that which does not get measured doesn’t improve.
The main reason that there hasn’t been significant movement in the discussion around how to measure the ROI of MLOps is because of the lack of understanding around the dependent relationship between defining the North Stars metrics and the maturity of the MLOps stack.
In this post we proposed a “single-frame” framework for not just how to think about the North Star metrics your team should be using but also how these metrics are deeply tied to the maturity of your MLOps stack and the types of questions your team should be asking.
We also suggested ways to use the framework to diagnose, strategize, and plan your personal roadmap to ML Platform Adoption Readiness.
A well-designed ML platform is meant to:
In other words, the entire point of MLOps Platforms and initiatives is to make the development and productionisation of machine learning products:
By:
Steps in the existing process.
By freeing up time, energy, and human resources through the creation of user interfaces that template the majority of use cases, your ML platform will drive additional value for both users and company.
So how can you tell whether your ML Platform and MLOps initiatives are actually accomplishing these goals?
Although measuring an engineering team is by no means a resolved debate, understanding the impact of ML Platforms and MLOps initiatives seems to be particularly difficult and opaque.
And there are a number of reasons why measuring success is tricky, oftentimes out of control for ML Platform teams.
For example, the origin of ML Platforms can be messy.
Teams and organizations can be kicked off through:
With that being said, there are practices that MLOps teams DO have control over and SHOULD BE doing more.
The three areas teams typically fail in are:
Let’s dig in.
A data scientist’s worst nightmare is being expected to develop models AND build infrastructure at the same time.
Developing high performing ML systems and products while coordinating multiple key stakeholders including product, legal, and engineering and keeping an eye on the newest developments in the world of data science and machine learning is an impossible task.
The second worst nightmare is having to navigate the minefield of internal services, tools, teams and weirdly named wrapper libraries with names that have no actual relationship with the underlying service.
Airflow-as-random-next-gen-pokemon anyone?
Kubernetes-as-a-tree?
Are we just using the random names generated by Github whenever you create a new repo at this point?
What does it mean to treat a “platform-as-a-product”?
When we think of “products” (according to Tom Geraghty) we think of the following characteristics:
When we apply the principles of developing products to platforms, this implies ML Platforms, should be:
Products are designed intentionally and deliberately with cohesiveness driving the user experience.
Can we really say that about our platforms sometimes?
Everyone’s met a bad Product Manager and everyone’s heard of a great one through a friend of a friend.
What makes these “mythical-unicorns-worth-their-weight-in-gold” so special other than their rarity?
A great PM is not only able to balance multiple sides of an, at times, rough conversation about priorities but they’re able to take requirements and narrow down the scope in clear, crisp terms that engineers can deliver on.
Part of the arsenal that a PM wields is being able to frame products in terms of solutions and provide initial validation wielding user interviews and analytics.
And in terms of their value to the business, they’re able to take a problem, craft the product vision, and work with engineering to articulate a product strategy that aligns with the company or organization’s North Star metric, supported by product KPIs around reach, engagement, & frequency.
MLOps and ML Platform engineers are fully capable of developing similar skills and yet often don’t.
The analytics skills a PM is fluent in (and that engineers can develop) include:
Product managers are adept at asking and getting answers about how deeply entrenched their product or project is becoming.
For example, a product manager responsible for a mobile app might be concerned with:
What happens when ML platforms aren’t adopted?
The obvious answer is if the platform isn’t used to its full potential, and the promise is speedier and easier workflows for the data scientists developing models, then status quo is maintained.
Life is exactly as it was before.
Some of the less obvious consequences and costs of a platform that doesn’t reach adoption in an organization:
One of the extreme consequences of an MLOps or ML Platform that doesn’t reach adoption is a form of “Shadow IT”, with data scientists building their own deployment paths that may be sub-optimal, if not downright dangerously insecure.
Think of a product that you use everyday, either as a data scientist, and engineer, or as a regular person.
Is this a product you love, hate, can’t quit?
Is this a product that you were introduced to at your job or maybe through word-of-mouth referral from friends?
Regardless of the source, this product or service is one that you can’t live without or don’t remember the time before.
There are a number of frameworks to describe the journey of a user as they’re first introduced to a product, create an account, try out, and eventually choose to continue using or discard.
One framework (The AARRR Pirate Framework) delineates the stages as:
Another framework (The Growth Loop) taught by Reforge emphasizes a compounding loop of:
If we extend the Reforge model to internal adoption of an ML Platform, we can think of the typical consumer of an ML Platform (like a data scientist) as going through a similar journey of:
An ML Platform that’s reached adoption is a platform where the components services are utilized regularly to service the Ml workflows or deployment patterns.
Choosing the right type of metrics requires defining:
We’ll show that most of the hard work in choosing North Star metrics for your platform is in these two considerations.
Let’s walk through the different components of the framework.
Our goal in this post was to present a “single-frame” framework to help you define and measure the metrics you should be using, taking into account the maturity of your MLOps initiatives and ML Platform stack.
By “single-frame”, we wanted to make sure our framework was condensed enough that you could print it out on a single, poster-sized sheet and hang it on your wall.
In the prior sections we built up the intuition behind our framework before jumping straight to the metrics themselves.
The most useful metrics ultimately capture the desired underlying behavior; they are otherwise useless or even detrimental without the right context.
Additionally we tried to show that the ability to measure scope and depth of impact will be limited by how mature your stack is, from the company’s generalized compute and storage choices to the ML-specific services and workflows.
Using the framework below should help your team land on candidates for your platform’s North Star metrics.
In a future post we’ll outline what we believe to be the generalized ML Platform Stack and the different patterns of that stack (especially in the time of LLMs).
For now, these are the levels of abstraction as we move through the stack:
When we say the Hardware (Layer 0) and Computer Frameworks (Layer 1) Layers are “generalized”, we mean in the sense that they aren’t just serving data scientists; instead, resources are being shared with other teams throughout product and engineering.
The Services Layer (Layer 2) is where we start to see the emergence of tooling focused on solving ML specific problems.
A good abstraction layer encapsulates the intricacies of the lower layers without leaking complexity into the day-to-day of the target users.
The Services (Layer 2) and Workflow (Layer 3) Layers are where the stage of the ML lifecycle also begins to matter, partially due to differentiation in ML system patterns.
For example, serving forecasting predictions to be used internally might not have the same requirements as a recommendation system exposed to external customers – and yet, the same platform may need to support both models and pipelines.
As a result, measuring the internal forecasting model latency on the same scale as the RecSys product might be an unreasonable expectation.
And finally the highest layer of abstraction, the Platform Layer (Layer 4), encapsulates all the workflows and layers into a single, unified interface that codifies the development and deployment paths for data science and machine learning projects, pipelines, and systems.
This is the layer that many teams and organizations claim to be but have yet to truly ascend, most stopping at the Workflow or Services layers.
Understanding the current, highest level of abstraction is important because the Stack layer constrains the kinds of questions your team is able to ask and improve.
Why is the stack layer a constraint on measuring progress?
Let’s understand the types of activities and concerns that each layer of the stack corresponds to as ideally, we’d like to define what exactly our initiatives are making easier, better, faster, stronger.
Specifically the categories of problems we’re able to solve fall into the buckets of:
ML Infrastructure Reliability is about how fast, powerful, and reliable the services supporting the ML workflows are, and are either binary (Yes/No) or have a clear quantitative value with a hard floor or ceiling.
For example, common questions asked at this layer include:
From a feature engineering perspective, questions that are typically asked at this layer include:
First off, what do we mean by ML Product Delivery?
Earlier in the post we talked about the importance of applying a “Platform-As-A-Product” mindset and understanding adoption.
We also hinted that the major questions asked at the Workflow Layer (Layer 3) could be understood as those of: Risk, Velocity, & Throughput (i.e. better, faster, stronger).
Common questions asked at this level include:
From a feature engineering perspective, questions usually include:
The Workflow Layer of a stack is all about “How well are the data scientists able to perform the jobs-to-be-done at each stage of an ML project’s lifecycle?” and consequently so are the questions we’d ask.
However those still aren’t the same as ML Platform Adoption.
As we tried to show earlier, evaluating the adoption of a Platform is about asking questions like:
Platform Adoption for ML Platforms and MLOps initiatives need not be challenging or obscure, especially if teams are willing to learn from the realms of Product Management and Growth.
In short, when evaluating “HOW” to measure platforms, we need to understand “WHAT” we’re measuring.
Now that we’ve outlined the key considerations that must be addressed before defining our metrics of interest, how do we go about it?
Here are the overall steps that we’ve discussed:
By this point the task of choosing the right metrics to track and ensure the platform initiatives are progressing as expected (or at the very least aren’t regressing) is trivial.
All the hard work in picking the right North Star metrics and using them to communicate the positive impact (and ROI) was in:
And applying existing best practices and frameworks developed in other areas like marketing and product to map the Data Scientist user journey.
With that being said, here are some common metrics used by MLOps teams based on their current platform situation and current pain-points.
Category of Metrics | Questions | Example Metrics |
---|---|---|
ML Platform Adoption | ✔️ How many teams are currently using the platform? ✔️ How deeply engaged or embedded are the workflows in their projects? ✔️ How much of the Slack support is provided by the owners versus the users? I.e. are users also helping users? ✔️ Are new features or enhancements being requested? |
👉 Conversion rate 👉 Adoption rate 👉 Feature adoption rate 👉 Time to value (time to adopt) 👉 Activation rate 👉 Usage frequency 👉 Churn rate |
ML Product Delivery | ✔️ Does the workflow support X, Y, Z processes? ✔️ Are we improving our “Mean-Time-To-Delivery”? How long do we have to wait from the model being ready to the model being served? ✔️ Are data scientists able to quickly & reliably train a model & serve it without having to fiddle with low-level APIs? ✔️ Do we have abstractions over the raw infrastructure that helps with reliability, re-use, and velocity? ✔️ How many manual tasks and how “painful” are there to deploy a model? ✔️ How long does it take to fix issues? |
👉 Lead Time For Changes (LT) 👉 Change Failure Rate (CFR) 👉 Mean Time To Recovery (MTTR) 👉 Mean Time To Restore 👉 Time To Deploy 👉 Manual Tasks To Deploy |
ML Infrastructure Reliability | ✔️ How fast can predictions be served? Can we serve at inference under our latency budget? ✔️ Do we need to do distributed training and are we able to? ✔️ Can we build features / training sets in an acceptable amount of time to not break flow? ✔️ Are we able to mirror the development environment in the product environment? ✔️ Do we have logging enabled? Are we able to monitor and detect data drift? ✔️ Can we support PyTorch or Tensorflow models? ✔️ Can we rollback models if there’s an outage? |
👉 Throughput 👉 Latency 👉 Availability 👉 Mean Time Between Failures (MTBF) 👉 Traffic (CPU Utilization, Memory Usage, Read/Write I/O Levels, etc) 👉 Saturation |
Interested in learning more about how Featureform can save you time and money during implementation with our virtual feature store approach?
Book a demo of the Featureform platform here!
And don't forget to check out our open-source repo!
See what a virtual feature store means for your organization.