Measuring Your ML Platform's North Star Metrics

August 7, 2023

min read

Developer Advocate

That Which Can’t Be Measured

Why does this question matter?

Platform initiatives, while valuable, can be expensive & complex undertakings that end up failing, costing data scientists & engineers time (& companies money).

Platform initiatives also take shape & grow over a long period of time, which can make measuring the impact of platform initiatives incredibly challenging.

‍

(Depicted: MLOps Initiatives Being Approved Months After Original Team Is Gone)

‍

Ultimately ML platforms are meant to improve the velocity, volume, & value of data science work.

And that which does not get measured doesn’t improve.

The main reason that there hasn’t been significant movement in the discussion around how to measure the ROI of MLOps is because of the lack of understanding around the dependent relationship between defining the North Stars metrics and the maturity of the MLOps stack.

‍

Goals for this blog post?

In this post we proposed a “single-frame” framework for not just how to think about the North Star metrics your team should be using but also how these metrics are deeply tied to the maturity of your MLOps stack and the types of questions your team should be asking.

‍

We also suggested ways to use the framework to diagnose, strategize, and plan your personal roadmap to ML Platform Adoption Readiness.

‍

Is Your ML Platform Actually Useful?

A well-designed ML platform is meant to:

Reduce Data Scientist Cognitive Load To Create A Streamlined Path To Model Deployment
Reduce Engineering Operational Burden To Unblock & Accelerate New Model & Product Development
Increase Company Revenue By Optimizing Data Science Product Flow.

‍

‍

In other words, the entire point of MLOps Platforms and initiatives is to make the development and productionisation of machine learning products:

Faster (aka take less time);
More Productive (aka get more, higher value stuff done using the same amount of time);
Less Risky (aka decrease the chances of bad things happening and people making mistakes)

By:

Removing
Automating
Codifying

‍

Steps in the existing process.

‍

‍

By freeing up time, energy, and human resources through the creation of user interfaces that template the majority of use cases, your ML platform will drive additional value for both users and company.

So how can you tell whether your ML Platform and MLOps initiatives are actually accomplishing these goals?

‍

Why ML Platform Teams Have Trouble Measuring Impact

Although measuring an engineering team is by no means a resolved debate, understanding the impact of ML Platforms and MLOps initiatives seems to be particularly difficult and opaque.

And there are a number of reasons why measuring success is tricky, oftentimes out of control for ML Platform teams.

‍

(Depicted: An MLOps Engineer Starting Their First Day Of A New Job)

‍

For example, the origin of ML Platforms can be messy.

Teams and organizations can be kicked off through:

Adoption of an existing tool or system by a different team with scope creep;
Being split off from the data science team;
As a newly mandated, centralized MLOps team;
As a distributed set of different teams coalescing onto the ML lifecycle due to their interactions with the data science team.

With that being said, there are practices that MLOps teams DO have control over and SHOULD BE doing more.

The three areas teams typically fail in are:

Not applying a “platform-as-a-product” mindset when designing and architecting their platforms;
Neglecting fluency in the kinds of analysis and analytics that product managers are famous for; and
Overlooking the user journey and ignoring the last, major milestone that platform teams should strive for: adoption.

Let’s dig in.

‍

Not Applying A “Platform-As-A-Product” Mindset

A data scientist’s worst nightmare is being expected to develop models AND build infrastructure at the same time.

Developing high performing ML systems and products while coordinating multiple key stakeholders including product, legal, and engineering and keeping an eye on the newest developments in the world of data science and machine learning is an impossible task.

The second worst nightmare is having to navigate the minefield of internal services, tools, teams and weirdly named wrapper libraries with names that have no actual relationship with the underlying service.

Airflow-as-random-next-gen-pokemon anyone?

Kubernetes-as-a-tree?

Are we just using the random names generated by Github whenever you create a new repo at this point?

‍

‍

What does it mean to treat a “platform-as-a-product”?

‍

When we think of “products” (according to Tom Geraghty) we think of the following characteristics:

Products have users/customers;
They do what they need to (no more, no less, at least initially)
Are long-lived;
Are owned; and
Evolve.

‍

When we apply the principles of developing products to platforms, this implies ML Platforms, should be:

Composable;
Self-service;
Quick & easy to start;
Up-to-date;
Attractive.

Products are designed intentionally and deliberately with cohesiveness driving the user experience.

Can we really say that about our platforms sometimes?

‍

Neglecting PM-Level Analytic Fluency

Everyone’s met a bad Product Manager and everyone’s heard of a great one through a friend of a friend.

What makes these “mythical-unicorns-worth-their-weight-in-gold” so special other than their rarity?

A great PM is not only able to balance multiple sides of an, at times, rough conversation about priorities but they’re able to take requirements and narrow down the scope in clear, crisp terms that engineers can deliver on.

Part of the arsenal that a PM wields is being able to frame products in terms of solutions and provide initial validation wielding user interviews and analytics.

And in terms of their value to the business, they’re able to take a problem, craft the product vision, and work with engineering to articulate a product strategy that aligns with the company or organization’s North Star metric, supported by product KPIs around reach, engagement, & frequency.

‍

Source: Cyanide & Happiness via Know Your Meme)

‍

MLOps and ML Platform engineers are fully capable of developing similar skills and yet often don’t.

The analytics skills a PM is fluent in (and that engineers can develop) include:

Understanding the different analytics frameworks (like the North Star + One Metric That Matters Framework, the AAARRR Pirate Framework, the Lean Analytics Framework, OKRs versus KPIs, etc);

Understanding the characteristics of good metrics, like being actionable, understandable, comparative;

Understanding how to structure metrics by selecting a set of metrics, breaking out the inputs from the outputs, and understanding trade-offs.

Product managers are adept at asking and getting answers about how deeply entrenched their product or project is becoming.

‍

‍

For example, a product manager responsible for a mobile app might be concerned with:

How many users are downloading and installing the app?

Common Metrics: App downloads, installation rates, and app store conversion rates.

Are users actively engaging with the app?

Metrics: Daily, weekly, or monthly active users (DAU, WAU, or MAU), session duration, and screen flow analysis to understand user behavior within the app.

Are users satisfied with the app's performance?

Metrics: App store ratings and reviews, user surveys, and feedback collected through in-app feedback mechanisms.

Are users completing desired actions or conversions?

Metrics: Conversion rates for specific actions such as sign-ups, purchases, or other key performance indicators (KPIs) that align with your app's objectives.

Is the app generating revenue or meeting business goals?

Metrics: Revenue generated from in-app purchases, advertising, or other monetization methods, as well as metrics like average revenue per user (ARPU) or customer lifetime value (CLTV).

‍

Overlooking The User Journey’s Last Mile, Adoption

What happens when ML platforms aren’t adopted?

The obvious answer is if the platform isn’t used to its full potential, and the promise is speedier and easier workflows for the data scientists developing models, then status quo is maintained.

Life is exactly as it was before.

‍

‍

Some of the less obvious consequences and costs of a platform that doesn’t reach adoption in an organization:

Wasted resources: The investment of time, money, and developer resources spent on building the platform may go to waste, resulting in a poor return on investment (ROI). This can have financial implications for the organization, as well as a loss of valuable developer hours that could have been utilized elsewhere.

Hindered productivity: If the platform fails to gain adoption, developers may continue to use older, less efficient methods, resulting in lower productivity and missed opportunities for innovation. It can hinder cross-team collaboration and impede the sharing of best practices, code libraries, and other resources.

Decreased morale: If developers have invested time and effort into building or adopting the platform but see minimal or no uptake, it can lead to frustration and decreased morale among the development team. This can impact overall motivation and employee satisfaction, potentially leading to higher turnover rates or reduced enthusiasm for future initiatives.

One of the extreme consequences of an MLOps or ML Platform that doesn’t reach adoption is a form of “Shadow IT”, with data scientists building their own deployment paths that may be sub-optimal, if not downright dangerously insecure.

‍

An Introduction To Product Thinking & Adoption

What is “adoption”?

Think of a product that you use everyday, either as a data scientist, and engineer, or as a regular person.

Is this a product you love, hate, can’t quit?

Is this a product that you were introduced to at your job or maybe through word-of-mouth referral from friends?

Regardless of the source, this product or service is one that you can’t live without or don’t remember the time before.

There are a number of frameworks to describe the journey of a user as they’re first introduced to a product, create an account, try out, and eventually choose to continue using or discard.

‍

One framework (The AARRR Pirate Framework) delineates the stages as:

Acquisition/Awareness – User finds out about the product or service;
Activation – The user takes an initial step towards using the product;
Retention – The users keep using the product;
Referral – The users encourage their friends to use it;
Revenue – The users receive so much value from the product that they decided to pay for it.

‍

(Source: What is the Pirate Funnel (AARRR framework) and how to apply it in 5 quick steps? By Ward van Gasteren)

‍

Another framework (The Growth Loop) taught by Reforge emphasizes a compounding loop of:

Retention & Engagement
Monetization
Acquisition.

‍

‍

Where Does Adoption Fit With The Data Science Lifecycle?

If we extend the Reforge model to internal adoption of an ML Platform, we can think of the typical consumer of an ML Platform (like a data scientist) as going through a similar journey of:

‍

Awareness (Acquisition): Discovering the ML Platform through an all-hands, a team office hour, an announcement through slack.

Activation: Maybe the Platform team has created a new templating tool that quickly sets up a Dockerized development environment for model training and development. The data scientist tries out the new feature or service from the ML Platform team with a toy project or even a past project they worked on.

Aha Moment (Engagement): The data scientist experiences the improved efficiencies of the new process, comparing it against their prior experiences (maybe just last week) of trying to get a new machine learning project started.

Habit Moment (Engagement): The data scientist begins to regularly use the templating tool except for a very specific ML architecture that still needs to be manually configured.

‍

An ML Platform that’s reached adoption is a platform where the components services are utilized regularly to service the Ml workflows or deployment patterns.

‍

‍

How ML Platforms Should Be Measured: The ML Platform Adoption Framework (MPAF) Way

Choosing the right type of metrics requires defining:

The current highest level of abstraction represented in your ML platform;
The baseline and desired behaviors.

‍

We’ll show that most of the hard work in choosing North Star metrics for your platform is in these two considerations.

Let’s walk through the different components of the framework.

‍

What is the ML Platform Adoption Framework?

Our goal in this post was to present a “single-frame” framework to help you define and measure the metrics you should be using, taking into account the maturity of your MLOps initiatives and ML Platform stack.

By “single-frame”, we wanted to make sure our framework was condensed enough that you could print it out on a single, poster-sized sheet and hang it on your wall.

In the prior sections we built up the intuition behind our framework before jumping straight to the metrics themselves.

The most useful metrics ultimately capture the desired underlying behavior; they are otherwise useless or even detrimental without the right context.

Additionally we tried to show that the ability to measure scope and depth of impact will be limited by how mature your stack is, from the company’s generalized compute and storage choices to the ML-specific services and workflows.

‍

Using the framework below should help your team land on candidates for your platform’s North Star metrics.

‍

Consideration 1: The Current, Highest Level of of Platform Abstraction

In a future post we’ll outline what we believe to be the generalized ML Platform Stack and the different patterns of that stack (especially in the time of LLMs).

For now, these are the levels of abstraction as we move through the stack:

Layer 4: Platform - Which includes frameworks and orchestrators like ZenML, Kubeflow, Flyte, etc.

Layer 3: Workflow - Which includes tools like Featureform, CometML, MLFlow, Weights & Biases, WhyLabs, etc.

Layer 2: Services - Which includes services like Redis, Spark, Snowflake, DuckDB.

Layer 1: Compute Frameworks - Which includes generalized computing & orchestration frameworks used by the whole company, including Kubernetes, Ray, Banana.dev, Modal, etc.

Layer 0: Hardware - Which includes generalized hardware and storage.

‍

When we say the Hardware (Layer 0) and Computer Frameworks (Layer 1) Layers are “generalized”, we mean in the sense that they aren’t just serving data scientists; instead, resources are being shared with other teams throughout product and engineering.

The Services Layer (Layer 2) is where we start to see the emergence of tooling focused on solving ML specific problems.

A good abstraction layer encapsulates the intricacies of the lower layers without leaking complexity into the day-to-day of the target users.

‍

The Services (Layer 2) and Workflow (Layer 3) Layers are where the stage of the ML lifecycle also begins to matter, partially due to differentiation in ML system patterns.

For example, serving forecasting predictions to be used internally might not have the same requirements as a recommendation system exposed to external customers – and yet, the same platform may need to support both models and pipelines.

As a result, measuring the internal forecasting model latency on the same scale as the RecSys product might be an unreasonable expectation.

And finally the highest layer of abstraction, the Platform Layer (Layer 4), encapsulates all the workflows and layers into a single, unified interface that codifies the development and deployment paths for data science and machine learning projects, pipelines, and systems.

This is the layer that many teams and organizations claim to be but have yet to truly ascend, most stopping at the Workflow or Services layers.

Understanding the current, highest level of abstraction is important because the Stack layer constrains the kinds of questions your team is able to ask and improve.

‍

Consideration 2: Defining Target Activities

Why is the stack layer a constraint on measuring progress?

Let’s understand the types of activities and concerns that each layer of the stack corresponds to as ideally, we’d like to define what exactly our initiatives are making easier, better, faster, stronger.

Specifically the categories of problems we’re able to solve fall into the buckets of:

Platform Adoption – Corresponding to the ML Platform Layer (Layer 4);
ML Product Delivery – Corresponding to the ML Workflow Layer (Layer 3);
Infrastructure Reliability – Corresponding to the Hardware (Layer 0), Compute Frameworks (Layer 1), and ML Services (Layer 2) Layers.

‍

What do we mean by ML Infrastructure Reliability concerns?

ML Infrastructure Reliability is about how fast, powerful, and reliable the services supporting the ML workflows are, and are either binary (Yes/No) or have a clear quantitative value with a hard floor or ceiling.

‍

For example, common questions asked at this layer include:

How fast can predictions be served? Can we serve at inference under our latency budget?
Do we need to do distributed training and are we able to?
Can we build features / training sets in an acceptable amount of time to not break flow?
Are we able to mirror the development environment in the product environment?
Do we have logging enabled? Are we able to monitor and detect data drift?
Can we support PyTorch or Tensorflow models?
Can we rollback models if there’s an outage?

From a feature engineering perspective, questions that are typically asked at this layer include:

How fresh is this feature?
What’s the latency to serve this feature?
What’s this feature’s SLA?
Is this feature created from a stream?
Can we run a nearest neighbor lookup on this embedding?

‍

The real challenge in setting the right MLOps North Star metrics is the boundary between ML Product Delivery and Platform Adoption.

First off, what do we mean by ML Product Delivery?

Earlier in the post we talked about the importance of applying a “Platform-As-A-Product” mindset and understanding adoption.

We also hinted that the major questions asked at the Workflow Layer (Layer 3) could be understood as those of: Risk, Velocity, & Throughput (i.e. better, faster, stronger).

‍

Common questions asked at this level include:

Does the workflow support X, Y, Z processes?
Are we improving our “Mean-Time-To-Delivery”?
Are data scientists able to quickly & reliably train a model & serve it without having to fiddle with low-level APIs?
Do we have abstractions over the raw infrastructure that helps with reliability, re-use, and velocity?

‍

From a feature engineering perspective, questions usually include:

Who owns this feature? How was it defined? Am I allowed to use it?
Which models are using which features?
Who built and maintains this feature?
How do I serve this feature in production?

‍

The Workflow Layer of a stack is all about “How well are the data scientists able to perform the jobs-to-be-done at each stage of an ML project’s lifecycle?” and consequently so are the questions we’d ask.

However those still aren’t the same as ML Platform Adoption.

‍

Ultimately Platform Adoption is a step above ML Product Delivery.

As we tried to show earlier, evaluating the adoption of a Platform is about asking questions like:

How many teams are currently using the platform?
How deeply engaged or embedded are the workflows in their projects?
How much of the Slack support is provided by the owners versus the users? I.e. are users also helping users?
Are new features or enhancements being requested?

Platform Adoption for ML Platforms and MLOps initiatives need not be challenging or obscure, especially if teams are willing to learn from the realms of Product Management and Growth.

In short, when evaluating “HOW” to measure platforms, we need to understand “WHAT” we’re measuring.

‍

How To Land On Your Team’s North Star Metrics

Now that we’ve outlined the key considerations that must be addressed before defining our metrics of interest, how do we go about it?

‍

Here are the overall steps that we’ve discussed:

Step 1(A): Determine your stack constraints by identifying The Highest Level of Platform Abstraction Currently Existing.

Step 1(B): If the current highest level of abstractions corresponds to Layer 2 (Services) or less in our ML Stack Map, then it will be incredibly challenging to consistently and reliably measure impacts on end-to-end ML delivery or ML platform adoption and enablement. Instead, measure impact on stages.

Step 2: Define Target Activities & Behaviors – What specific activities or processes are you looking to improve? What is the desired outcome?
Step 3: Picking (and Committing to) Key North Star Metrics

‍

A Note: Choosing The Right Metrics

By this point the task of choosing the right metrics to track and ensure the platform initiatives are progressing as expected (or at the very least aren’t regressing) is trivial.

All the hard work in picking the right North Star metrics and using them to communicate the positive impact (and ROI) was in:

Understanding the relationship between the data scientists and the platform;
Understanding the current state of the platform;
Identifying the target behaviors the initiatives are trying to influence;

‍

And applying existing best practices and frameworks developed in other areas like marketing and product to map the Data Scientist user journey.

With that being said, here are some common metrics used by MLOps teams based on their current platform situation and current pain-points.

‍

Category of Metrics	Questions	Example Metrics
ML Platform Adoption	✔️ How many teams are currently using the platform? ✔️ How deeply engaged or embedded are the workflows in their projects? ✔️ How much of the Slack support is provided by the owners versus the users? I.e. are users also helping users? ✔️ Are new features or enhancements being requested?	👉 Conversion rate 👉 Adoption rate 👉 Feature adoption rate 👉 Time to value (time to adopt) 👉 Activation rate 👉 Usage frequency 👉 Churn rate
ML Product Delivery	✔️ Does the workflow support X, Y, Z processes? ✔️ Are we improving our “Mean-Time-To-Delivery”? How long do we have to wait from the model being ready to the model being served? ✔️ Are data scientists able to quickly & reliably train a model & serve it without having to fiddle with low-level APIs? ✔️ Do we have abstractions over the raw infrastructure that helps with reliability, re-use, and velocity? ✔️ How many manual tasks and how “painful” are there to deploy a model? ✔️ How long does it take to fix issues?	👉 Lead Time For Changes (LT) 👉 Change Failure Rate (CFR) 👉 Mean Time To Recovery (MTTR) 👉 Mean Time To Restore 👉 Time To Deploy 👉 Manual Tasks To Deploy
ML Infrastructure Reliability	✔️ How fast can predictions be served? Can we serve at inference under our latency budget? ✔️ Do we need to do distributed training and are we able to? ✔️ Can we build features / training sets in an acceptable amount of time to not break flow? ✔️ Are we able to mirror the development environment in the product environment? ✔️ Do we have logging enabled? Are we able to monitor and detect data drift? ✔️ Can we support PyTorch or Tensorflow models? ✔️ Can we rollback models if there’s an outage?	👉 Throughput 👉 Latency 👉 Availability 👉 Mean Time Between Failures (MTBF) 👉 Traffic (CPU Utilization, Memory Usage, Read/Write I/O Levels, etc) 👉 Saturation

‍

Conclusion: Ensuring ML Platform Adoption Readiness

‍

‍

ML Platform teams struggle with measuring, and consequently communicating, the impact of their MLOPs initiatives and ML Platforms.

One reason is the nonlinear and muddled relationship between an individual ML developer’s productivity and how the ML Platform enables training, productionization, and deployment of ML pipelines.

However the main reason that there hasn’t been significant movement in the discussion around how to measure the ROI of MLOps is because of the dependent relationship between defining the North Stars metrics and the maturity of the MLOps stack.

The discussion around defining metrics couldn’t mature until the stack matured because until recently most teams were stuck trying to solve problems at the infra & service level.

These problems were the bottleneck that prevented the conversation around measuring ROI of MLOps to progress to the workflow and platform level.

An important aspect that has also been missing is the importance of treating ML Platforms “as-a-Product” and achieving internal Platform-Data Scientist Fit (analogous to Product-Market Fit) through strategies like user surveys, quantitative analysis of activation-engagement-retention, & user testing to support additional investment.

In this post we proposed a “single-frame” framework for not just how to think about the North Star metrics your team should be using but also how these metrics are deeply tied to the maturity of your MLOps stack and the types of questions your team should be asking.

We also suggested ways to use the framework to diagnose, strategize, and plan your personal roadmap to ML Platform Adoption Readiness.

‍

Interested in learning more about how Featureform can save you time and money during implementation with our virtual feature store approach?

Book a demo of the Featureform platform here!

And don't forget to check out our open-source repo!

Measuring Your ML Platform's North Star Metrics

That Which Can’t Be Measured

Why does this question matter?

Goals for this blog post?

Is Your ML Platform Actually Useful?

Why ML Platform Teams Have Trouble Measuring Impact

Not Applying A “Platform-As-A-Product” Mindset

Neglecting PM-Level Analytic Fluency

Overlooking The User Journey’s Last Mile, Adoption

An Introduction To Product Thinking & Adoption

What is “adoption”?

Where Does Adoption Fit With The Data Science Lifecycle?

How ML Platforms Should Be Measured: The ML Platform Adoption Framework (MPAF) Way

What is the ML Platform Adoption Framework?

Consideration 1: The Current, Highest Level of of Platform Abstraction

Consideration 2: Defining Target Activities

What do we mean by ML Infrastructure Reliability concerns?

The real challenge in setting the right MLOps North Star metrics is the boundary between ML Product Delivery and Platform Adoption.

Ultimately Platform Adoption is a step above ML Product Delivery.

How To Land On Your Team’s North Star Metrics

A Note: Choosing The Right Metrics

Conclusion: Ensuring ML Platform Adoption Readiness

Ready to get started?

PRODUCT

RESOURCES

COMPANY

PRICING

DOCS

Measuring Your ML Platform's North Star Metrics

That Which Can’t Be Measured

Why does this question matter?

Goals for this blog post?

​​Is Your ML Platform Actually Useful?

Why ML Platform Teams Have Trouble Measuring Impact

Not Applying A “Platform-As-A-Product” Mindset

Neglecting PM-Level Analytic Fluency

Overlooking The User Journey’s Last Mile, Adoption

An Introduction To Product Thinking & Adoption

What is “adoption”?

Where Does Adoption Fit With The Data Science Lifecycle?

How ML Platforms Should Be Measured: The ML Platform Adoption Framework (MPAF) Way

What is the ML Platform Adoption Framework?

Consideration 1: The Current, Highest Level of of Platform Abstraction

Consideration 2: Defining Target Activities

What do we mean by ML Infrastructure Reliability concerns?

The real challenge in setting the right MLOps North Star metrics is the boundary between ML Product Delivery and Platform Adoption.

Ultimately Platform Adoption is a step above ML Product Delivery.

How To Land On Your Team’s North Star Metrics

A Note: Choosing The Right Metrics

Conclusion: Ensuring ML Platform Adoption Readiness

Ready to get started?

PRODUCT

RESOURCES

COMPANY

PRICING

DOCS

Is Your ML Platform Actually Useful?