This episode of our MLOps Weekly Podcast features Simba Khadder, Featureform’s CEO, where he unravels the true meaning of “real-time” machine learning. The discussion breaks down real-time ML into three core aspects: latency, online serving, and real-time features. Simba also covers:
Listen on Spotify
[00:00:06.120] - Host
Hey everyone, [inaudible 00:00:07] here and you are listening to another episode of the MLops Weekly podcast. Today we're going to be diving into a topic that's really, in my opinion, very misunderstood around machine learning, which is the meaning of real time. So throughout this episode, I'll personally be talking through what people mean when they say real time. Common misconceptions. And in the end, if someone was to tell you to build a real time machine learning system, you'd be able to break that down, understand what that really means for this specific use case.
[00:00:39.570] - Host
So let's begin by talking about real time, how it's used. Real time means a lot of different things in a lot of different situations. Real time for machine learning is typically broken down to three slightly related concepts. One is latency, two is online serving and three is real time features. It's a bit circular since people use real time features as well. So we'll also break that down for this episode.
[00:01:12.550] - Host
Let's begin by speaking about the first thing I mentioned, which is low latency serving. So when people mention real time a lot of the time they're referencing that I can serve quickly in real time, it doesn't really mean much it's almost like saying the service is fast. Fast is going to be super relative. A good way to think about this is think about ChatGPT. ChatGPT, you could argue, is a real time system. When you type chat into it, it responds back in real time.
[00:01:48.150] - Host
That said, it's quite slow for a lot of use cases it's horrifically slow. If it was a recommendation system that took a second to respond and give you recommendations, that would be awful. I mean, even me, as I use ChatGPT, I will tell it to abbreviate answers so that it can get things out faster.
[00:02:08.980] - Host
So that's like a second response rate. There are other situations where, for example, if you are dealing with a credit card transaction, and you're doing some form of fraud detection on it, you might actually have a little bit of time because the transaction itself is going to take a moment and there's an implied wait time, like users aren't counting the milliseconds for something like that. So it might be a little bit slower, which might be fine. You have more latency, you have more of a budget to work with that.
[00:02:41.180] - Host
And then finally, there are some situations, recommender systems being a really good one, where you really have to move fast. For example, if you open YouTube or any site Airbnb that starts with a personalized feed, you want that to show up as fast as possible. And in those situations, the latency is really, really important. You really want to make sure that's low.
[00:03:03.020] - Host
So I think, thinking about your use case, thinking about what the end user, the person that could be in a service would expect, what is acceptable. A lot of times when I think of that, if someone's like, hey, the system has to be real time, I'll ask about latency, and then I'll ask about the budget latency budget and break it down. Because it's not just a model where stub features, there's the network, there's a lot of things that come into play, and oftentimes, like in the case of a recommender system, it's feeding into something else. So you might have 500 milliseconds to put out a page, but the recommendations have to be down in 200 milliseconds because you need 300 milliseconds to grab the images and do everything else. So that's the first concept of real time, which is latency.
[00:03:49.050] - Host
The other thing that comes up in real time is the idea of an online model. So to describe what an online model is, I will first define an offline model. An offline model, sometimes called a scoring model or a batch model, runs typically for some set period of time, oftentimes on some timescale, and it works on offline data.
[00:04:09.510] - Host
Scoring is a really good way to think about this. Let's say that I am going through a list of my user base. I have a database full of my users, and I want to decide who I should give a special offer to. A personalized offer to like a 20% off offer. Because I'm e-commerce, I don't need to do that in an online fashion. Like I can just once a day decide who gets what coupons.
[00:04:34.390] - Host
And so what I'll do is I will spin up my model, I'll essentially loop through the database for each row, I'll make a prediction, and then once I'm done, the model turns off. The benefits here are in situations where you can do this, you don't have to worry about the model is offline for some period of the time, like literally not doing anything for some period of time. It makes updates easier if it misbehaves, like it does something weird, it does something wrong, you can always just fix it, run it again. So typically the stakes for something like that are much lower. And so that's an offline setting, an offline model.
[00:05:10.420] - Host
An online model is a model that, as is implied, is always running. And it is always ready to receive requests from users. So for example, like a recommendation system, at any point in time, the recommendation system might be running and ready to receive a request and be able to do a recommendation, and this is what forms an online model.
[00:05:32.680] - Host
It's also associated with real time. Not all real time models have to be online in the sense of some people might just mean real time to mean really, really fast. This really low latency, high throughput, just like I can go through my whole database in five minutes because it's real time, even though it's an offline model, some people very clearly mean that it's an online model when they say real time. So breaking that down too, and breaking down the distinction.
[00:06:00.890] - Host
It actually can get much more interesting and complicated because you can merge together online and offline models. A good example is recommendation systems, again where sometimes how systems will work in offline fashion, they will generate candidates. So it's like there's a billion YouTube videos, but I'm gonna narrow it down to 10,000 videos that I will ever recommend to this user for whatever time period.
[00:06:28.190] - Host
So then when I have to do the real time recommendation, rather than having to work across the whole corpus of billions of videos, I just have to work across 1000 or 10,000 videos that I think are plenty relevant to them and enough diversity, so you can mix them together. And the reason you would do that is to achieve lower latency, typically so you don't have to process everything at once.
[00:06:51.690] - Host
Online models don't have to be fast, though for some use cases, it doesn't really matter if they are fast or not. Again, chat JPT is a great example of this. Like it's not superfast, it's online, but it's always ready to receive requests. But it could take a bit. This is also true if you had a service. Lots of generative services, like generative, anything working with images tends to be, or video rather, tends to be kind of slow, because it's a very computationally heavy problem. In those sorts of situations you'll also see online models that don't have low latency, which means that they are quite slow.
[00:07:25.690] - Host
The final pillar that is often associated with real time is what a lot of people just call real time features. This kind of becomes a whole never problem, which is you have to define real time again. And real time features, there is overlap between, like a real time model, they're not the same. The concepts are slightly different.
[00:07:44.710] - Host
For a real time feature, there are two properties that are being referenced. If someone says real time feature, it's not useful, it's not descriptive, it doesn't really mean anything. But they're oftentimes talking about two axes. One is also, is latency, which is just how fast can I get this feature?
[00:08:04.360] - Host
And the other is freshness, which is the idea of how old is this feature? When was it created? How stale is the data? The thing is though, in many situations, to get highly fresh data is slower, which means it will be higher latency and vice versa.
[00:08:21.890] - Host
A really easy way to achieve super low latency is to cache everything. But if you cache things well, by the design of caching, the data will be a little bit out of date how much out of date is going to depend on the system?
[00:08:36.860] - Host
But if someone was to tell me that we need a real time feature here, I would understand why. First, like what's the goal? Why does this feature have to be real time? What's the point? And two, are we talking about real time from the serving perspective or real time from the freshness perspective? How much latency do we have to serve this feature?
[00:08:58.300] - Host
And then two, it's not really how old is the data? It's not really a good question. The question is more how much does this data tend to change, and how dramatically does it change? For example, if I have a feature which is a user's favorite song in the last year, last 365 days, that's not going to change that often. It will change, for sure it will change, and especially when the user is new.
[00:09:27.650] - Host
But after you've been using a platform like Spotify for a year, your top song is not going to change every moment, or every minute, or every five minutes. It probably won't even change every day. It might not change at all for the whole year. So choosing the right time window to process makes it easier to achieve the latency you need, and also cuts on compute cost.
[00:09:49.990] - Host
There are three types of features, and those three types of features have different characteristics around latency and freshness. One type of feature we call batch feature, a batch feature, as the name implies, is typically run across the full batch of data all at once, typically on some time frame, so they're triggered.
[00:10:11.880] - Host
They're really easy to build, or nice to build, because they mimic what data scientists are doing in their notebook. When data scientists work in their notebook, they will take either all the data or a sample of data, and then kind of process things in batch.
[00:10:25.240] - Host
And then on the other side of it, it tends to have really low latency because you oftentimes batch compute all your features and then cache them in something like Redis or Dynamo. So yeah, those are the two pieces of it.
[00:10:36.610] - Host
Now, the next type of feature is a streaming feature. Now, a lot of the people associate streaming features of real time, and it makes sense because streaming, unlike batch, has the characteristics. So a streaming feature is a feature where you're processing off of a stream. So you are kind of, as new events come in, you're updating the features as they come in.
[00:10:58.870] - Host
This means that the latency is going to be low because you're updating a cache constantly. So you have the same sort of latency as you would expect to see in a batch feature. But unlike a batch feature, which tends to run on some time period, the streaming feature is continuously updating features. It's continuously running.
[00:11:15.830] - Host
In continuously running it is again updating the cache, and it's keeping it pretty fresh. It's just as long as it takes for the event to get to the stream, go through the stream, be processed by the streaming processor, and then get updated in the cache. That could be quite fast.
[00:11:33.040] - Host
However, it's not instantaneous. A good way to imagine this is, let's say a user is typing a comment, and you want to see if that comment contains hate speech. So you have a model that does that.
[00:11:49.390] - Host
Well, when the comment comes in, you're likely going to featureize it in some way, you're going to break it down into some features. Now, if you're doing it with a batch feature, you would have to literally wait for the next batch run to finish, which might be like a day. So it doesn't really make sense for an online use case.
[00:12:07.280] - Host
The streaming use case, a lot of people think, well, let's just do it via streaming. The problem with a streaming use case is if you throw it at the front of a stream, it's still going to take a bit to get all the way through, get processed, get updated in the cache, and you essentially have to just wait and keep hitting the cache, waiting for the value to change, or waiting for the feature to essentially show up there.
[00:12:29.400] - Host
It's slow. It could be slow. One and two it's just needlessly complicated. It feels needlessly complicated to get a request and then throw that request into a stream, wait for the stream to finish, and then have that return to the value. You'd rather just run the function right there, right? Like if I had the featurized function, say it's a python function, why do this whole streaming stuff when I could just run the function at the time of the request?
[00:12:55.710] - Host
That is the final type of feature. We call them on demand features. We're features that run at request time. So streaming and batch features are pre-processed. So they're kind of running asynchronously of the request and an update in cache if it gets hit per request.
[00:13:10.690] - Host
With on demand features, they're run on each request. And so the nice thing about on demand features is in terms of freshness, it's extremely fresh. It's like you're processing the features as they come in. It's about as fresh as you can get.
[00:13:29.490] - Host
Latency, though, can be an issue because you are no longer caching, you're no longer processing asynchronously, you're processing in line with the request. So it just added onto the latency. So however long that function takes a run is just added to the latency. For certain operations, it might be negligible for heavy operations like let's say your on demand feature runs a SQL query. Because you need to have the most up-to-date values to be able to run your model on. It's going to take a while to actually run that query to get the data and then to return the data to the mall.
[00:14:04.130] - Host
So on demand features is the third type of feature. So a lot of people are kind of under this impression, especially around streaming in batch, where you just have to pick one, not the other. In practice, in my opinion, you can use, and I have many examples where people use all three. Some features, like we talked about, they don't require really, really fresh data. They don't have to be up to the second. They can be 24 hours old. It might not be a huge deal.
[00:14:31.040] - Host
In those situations, though, when the data is being processed on whatever timeframe that is, you get the super low latency that you might need, and the other really nice thing is the simplicity of it. Batch features are really, really simple, and they really mimic what data scientists are doing in their notebooks anyway. So there's a huge benefit to that, to the simplicity.
[00:14:53.480] - Host
Streaming, though, it has the same latency as batch and is fresher. Yeah, like the data is fresher. You might just assume, well, why don't we just always do that? It's just strictly better, it seems. Well, the cost is complexity, because the way you write a streaming feature and the problems that you encounter when dealing with a stream of data are just completely different from the type of problems that you deal with in a batch transforming data, because there's just many more things that can go wrong.
[00:15:21.960] - Host
Also, unlike batch, which tends to run... It's running on some time period. So let's say you do a batch run and the features are wrong, or your model doesn't do what you thought it was going to do, you could always just rerun it, update the transformation, run it again.
[00:15:37.420] - Host
With streaming, it's not as easy to just change the feature. It's not really meant or made to do that, and that adds complexity. And then finally, on demand features. On demand features are premature necessity in a lot of situations. You just have situations where the data you need to create your features comes with a request. In those situations, you need an on demand feature.
[00:15:59.300] - Host
You also don't need to, you can, and many times you will use an on demand feature along with a pre-process feature. For example, if I want to generate a feature, which is, let's say I am getting transactions like I am doing fraud, and my feature is, what percent of the user's balance does this transaction make up? So if the user has $1,000 in their bank account, and they're spending $100 maybe it's more likely to be fraud.
[00:16:25.910] - Host
So that percentage is a feature, but the transaction amount I get with their request, but the balance I can pre-process. So the balance feature might be pre-processed, and then it might be process and batch even. It could be process and streaming. And then the transaction amount I get with the on demand feature and the on demand feature will join them together and do the division and return the answer.
[00:16:48.530] - Host
So those are the three types of features, and they all have different characteristics when it comes to freshness and latency. They fit together, I mean, features feed into models, so they kind of feed together all these components. The latency of the model is the model online or offline? For each feature, what is the latency we need from that feature serve and how fresh does that data need to be? If you answer all those questions for all the features and the model, you can really define what real time means for your use case, because what real time means for you and your use case is going to be completely different from what it means to another use case.
[00:17:28.750] - Host
And to do it correctly, you just have to think about in context, even the features, if I have one feature that has super low latency, but the one next to it has super high, then it doesn't really matter because the mall needs all those features to start anyway. Freshness can vary per feature, and it's really going to depend on what kind of feature it is.
[00:17:45.250] - Host
The main takeaway I want people to get is, when you're asked to build a real time ML system. Here's how you break it down and here's how you come up with an actual game plan that makes sense for your specific use case. And you aren't just like doing streaming for the sake of it because you think that's a good idea because someone said real time. So I hope this is insightful. Let us know what you think of this new format. We're just be talking. We'll definitely be throwing in interviews along with this as well. Thanks for listening.
See what a virtual feature store means for your organization.