Breaking Down "Real-Time" Machine Learning Systems

Simba Khadder
March 22, 2024

The term "real-time" in machine learning has become overloaded and meaningless. When someone describes an ML system as "real-time," they may be referring to online models, low-latency serving, or up-to-date and instant featurization of data. These three buckets are often lumped together into the qualifier, so it's important and necessary to tease out these differences. Otherwise, the systems that are built may not solve the problems at hand or may be over-engineered and complicated. The aim of this article is to dissect the various concepts and considerations integral to 'real-time' ML, paving the way for the development of systems tuned to the problems they aim to solve.

“Real-Time” Models

The three components commonly associated with 'real-time' models are

  1. online inference
  2. low-latency serving
  3. “real-time” features

Sometimes, it means all three, and other times it means just one. This lack of clarity isn't just a semantic issue—it leads to practical problems. Without a deeper breakdown, it's impossible to decipher what is meant by "real-time," and it hampers our ability to tailor solutions to specific problem spaces. For instance, efforts to achieve sub-30 ms response times might compromise model accuracy, only to find later that allowing for up to 500ms could have enhanced predictive quality with reduced engineering effort and been fast enough.

Low Latency Serving

Low latency serving is often associated with the “real-time” phrase. It's about how fast a model responds to requests. But even then, it’s still unclear. Sometimes, it can mean under 10ms latency and other times, it means under 1s. The “low” in low latency is completely context-dependent. For some use cases, like recommender systems, response speed is paramount. In others, like fraud detection, we can sometimes get away with 100ms or 1s response times.

Online Inference

A separate concept overloading the “real-time” phrase is online inference, which refers to models running on a serving platform and perpetually ready to handle requests. This contrasts with offline models, like lead-scoring systems, which activate, make inferences, and then go dormant. The online approach implies constant readiness, but it's just one concept of what gets attributed to "real-time." Online inference signifies availability, not necessarily low latency or working on up-to-date data.

Using “Real-Time” Features

The third catch-all attribute associated with "real-time" ML uses "real-time" features. Features are inputs to a model, so a "real-time" model may need "real-time" features. This pushes the confusion further down the stack, as "real-time" is still ambiguous when describing features.

When talking about features, “real-time” usually references either or both:

  1. Freshness
  2. Latency
Freshness refers to how old the feature is (clock hand) and latency is how long it takes to get to the model (speed lines)

The easiest way to lower feature latency is by caching pre-processed features. However, this makes the features less fresh. The importance and impact of freshness and latency are context-dependent, another reason we need to use more descriptive terminology.

Latency

Latency is one of the concepts overloading the “real-time” feature phrase. For some ML systems, like recommender systems and fraud detection systems, latency can play a pivotal role. Latency refers to the time it takes for a model to retrieve the values of the features it needs. Many systems pre-process features into inference stores like Redis or DynamoDB to achieve lower latency. This approach ensures low latency serving by trading off the freshness of data.

For some ML systems, having features that are a day old is fine, and having low latency is more critical. For example, if you have a feature that was a user’s favorite genre in the last 30 days, you may want to update it via a batch job every day and not impact model performance. When you want a user’s favorite song in the last hour, you may take on the complexity of running a streaming pipeline to achieve fresher features. Finally, in a situation where you want to check if a user’s comment is spam, you may send the comment along with the request and generate the features using an on-demand transformation. Saying that a feature needs to be real-time isn’t descriptive enough. You need to understand the latency requirements to tailor the approach to the specific needs and constraints of the application.

Freshness

Freshness is another concept that overloads the phrase "real-time" features. It refers to how long ago the feature was last updated at inference time. Freshness matters because the most recent data often provides the most relevant insights, especially for features that are highly dynamic temporally. However, prioritizing freshness usually means accepting higher latency. This is because fresher features often require on-demand processing, which takes more time compared to retrieving pre-processed data from fast-access stores. Some people assume that using Kafka and Flink for streaming gives them extremely fresh features, but that’s not always true. It takes time for data to go through a streaming pipeline, even though it’ll be fresher than it would with a scheduled batch transformation. However, with streaming features, we take on much more complexity than we would have to deal with by doing scheduled batch features. Stream jobs are always running and are often stateful. Errors require rewinding in a stream, updating code, and running. Updating feature logic requires backfilling and other complex data operations. The best thing to do is understand how much the feature changes temporally and how much different levels of feature freshness affect a model; then, we can tailor the correct pipeline for the feature that has the best freshness, latency, and complexity for the use case.

Building “Real-Time” Feature Pipelines: Balancing Complexity, Latency, and Freshness

Even though the two concepts that overload “real-time” with regard to features are freshness and latency, we often need to balance both along with a third characteristic: complexity. The complexity here refers to the complexity of building, maintaining, and iterating on the pipeline. 

There are three types of feature pipelines:

  1. Pipelines that pre-process features via batch jobs
  2. Pipelines that pre-process features via a streaming job
  3. Pipelines that generate them at the time of a request

This section will break down the three types of feature pipelines and their strengths and drawbacks, along with a decision tree to help you choose the right pipeline for your use case.

Pre-processing via Batch Jobs

Most features in production are pre-processed via batch jobs. These jobs are triggered by new files, data rows, or at some time interval. Batch preprocessing offers a manageable and predictable approach for handling large data volumes. This method closely mirrors the logic used in the offline analysis, making the transition to production more straightforward. Its primary advantage lies in operational simplicity, especially compared to streaming, and the low latency at inference time via preprocessing. However, batch preprocessing trades off data freshness; the data may not always represent the most current state, which can be a limitation for rapidly changing features. Despite not delivering the latest data, batch processing often provides sufficient freshness for many applications. It is typically easier to manage, with the ability to schedule frequent updates to maintain relatively fresh data. In situations where features can be pre-processed, and extremely freshness isn’t paramount, this approach emerges as a practical and efficient solution.

Pre-processing via Streaming Jobs

Preprocessing features via streaming jobs stand out for their ability to provide fresher data and low latency at the cost of operational complexity. In this approach, data is processed continuously as it is generated, significantly reducing the time between data being received and features being updated for model inference. However, streaming is not without challenges. It typically involves a lag period, albeit shorter than batch processing, from the moment data enters the stream to when it is fully processed. This extra lag makes it unsuitable for features built off of request-time data. The complexity of setting up and maintaining streaming pipelines is another factor to consider. These systems require rigorous oversight, including backfilling missing data and managing various edge cases, which can be more demanding than managing batch jobs. The logic often differs from what was written in an offline analysis and may require a rewrite of feature logic. Despite these challenges, streaming preprocessing offers a valuable solution for features that change rapidly and require low-latency serving, albeit with the trade-offs of increased complexity and some latency.

Processing at Request Time with On-Demand Functions

On-demand request-time processing maximizes feature freshness at the cost of latency. This method involves processing data precisely at the moment it's required. This feature is required for situations where we only have access to the data to create the feature at the time of a request. The primary drawback of this approach is its potential for higher latency. Processing data on-demand can be both time-consuming and resource-intensive. On-demand processing is most effective in use cases where preprocessing is not possible, and the advantages of having the most up-to-date information outweigh the latency incurred during data processing. It also demands checking infrastructure capabilities for the specific needs of the application.

Rethinking "Real-Time" in Machine Learning

As it's currently used, the phrase "real-time" in machine learning is a broad-brush label that often muddles understanding rather than clarifying it. This catch-all phrase indiscriminately groups together diverse operational concepts — online inference, low latency serving, and feature freshness — each with its own intricacies and trade-offs. The reality is more nuanced: these concepts represent a range of operational modes of an ML system, each uniquely fitting different scenarios.

Using "real-time" as an adjective risks your system not matching your needs or being over-engineered. Adopting more descriptive and precise terminology will facilitate clearer communication within teams and aid in more informed decision-making around system architecture and design. It's about accurately identifying and articulating the needs of a specific application. Ultimately, ML systems are meant to fit a set of requirements derived from solving a user’s problem. By using more precise language to break down the "real-time" term, we can build ML systems better and faster.

Related Reading

From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.

even more resources
blue arrow pointing left
blue arrow pointing right