The Future of Data Infrastructure in MLOps with Sam Partee

Episode 
4

MLOps Weekly Podcast

The Future of Data Infrastructure in MLOps with Sam Partee
Principal Applied AI Engineer, Redis
Simba Khadder

Listen on Spotify


Simba: Hey everyone, I'm Simba Khadder and you are listening to the MLOps Weekly Podcast. Today, I'm chatting with Sam Partee Principal Applied AI Engineer at Redis. Prior Redis, he worked at HP Enterprise where he worked on some of the foundations of MLOps tooling for high performance computing. Sam I'm super excited to have you on the show today.

Sam: Yeah, I'm excited to be here. Thanks for having me.

Simba: So it'd be great to give some context of audience how you got into MLOps

Sam:  Oh, that's a great question because actually even machine learning wasn't the field I started in, it was high performance computing at, and originally I started working on a parallel programing language called Chapel, that parallel programming language. I actually built a package manager for, which is kind of how my love of tools began. And I was lucky enough to be able to join a newly forming AI team right after working on that language. And on that team we did distributed training and inference really the prior stages of the ML life cycle. So we worked on things like feature selection, hyper parameter optimization, getting tools like Dask on Ray to run super computers. And in HC tooling, everybody's sort of their own CI admin too. You know, there's not a lot of support for the type of cluster and super computer systems with a few exceptions in the MLOps space, some tools like Rendered AI, have tried to corner that.

But I'm also a bit of a hobbyist; I've built my own systems and I've deployed Slurm and Kubernetes and ML flow my own systems, run my own Jupiter workflows. So a bit of a hobbyist, but right after that AI team formed had the opportunity to start create labs, which is open source organization within HPE. They focused on the applications of deep learning to numerical simulations. The main product was Smart SIM and Smart Redis which is how I was exposed to Redis. Essentially the product was built to enable distributed simulations and CC++ and FORTRAN to use Redis as an online feature store, and as well use Redis AI as an inference serving engine. The whole point was essentially it allowed these really large distributed NPI simulations in FORTRAN C, and languages that traditionally don't have great access to the type of MLOps tooling that we have in Python or other languages. And let them call [inaudible 0Sam:20] and on with a simple API, but that culminated in Journal of Computational Science paper with a National Center of Atmospheric Research, which is a really invigorating read if you get the chance.

But shortly after, because of that exposure to Redis, I was able to then join the Determined AI team and work on some distributed training and experimentation for ML, which once again was that kind of front end of the MLOps life cycle, which is really the focus at HP. And then because of that exposure to Redis ended up working for Redis and started getting exposed to the other side of MLOps cycle. So kind of long winded wrap up to that explanation; I started in HPC, which exposed me to AI using AI on those platforms made me realize the need for MLOps because of the lack of support in the HPE, and then fell in love with Redis in the ecosystem and ended up working for them.

Simba: That's awesome. So you can officially say that you are one of the very few people in the world who have actually built MLOps tool for FORTRAN.

Sam: Yes I am. And I can say that probably.

Simba: Well, going into that, like, I mean, going from like FORTRAN, see kind of like that world in HP to now in Redia. Redis is used all over from very tech forward companies to kind of literally everywhere. How has your perspective on MLOps changed?

Sam: Well, it's really interesting. So at HPE was more, like I said, the front end of the life cycle. So experimentation with platforms like Determined AI or distributed training, you can think of frameworks like Overlord, or we were doing distributed HPO and trying to come up with new algorithms to do HPO more efficiently across let's say a hundred nodes where each node has a 48 core CPU and 4 V hundreds. So scheduling all of that and job; scheduling and working with workload, managers like SLURM and Kubernetes, doing all of that side of things, it's much less about the production aspect of machine learning. It's much more about the experimentation phase, the data loading, the distributed training. And those…it's that side. You don't necessarily have to utilize machine learning and production to be able to do or benefit from the other side, which I would say is more so the deployment data, ETL side of things. That's really what I've found at Redis, and also partly in the Smart Slurm work that I did, where we used it as an online feature store. I was exposed to that there, and so from HP to Redis, I've been able to kind of transition from that experimentation phase and distributed, deep learning over to really using machine learning and production and finding out ways to use it more efficiently.

Simba: That's really interesting. It's almost like a lot of focus, even in tooling has moved from training towards production just as people have.

Sam: Yeah, totally agree.

Simba: Taking a step back, how would you…Because MLOps means so many different things. We talked about experimentation, we talked about production and I'm sure many, many other kind of slices, but all things MLOps. How would you define MLOps?

Sam: Yeah, so this one's interesting, kind of like I was saying, it goes in stages, you could say it's like data operations, which I would consider to be like processing, cleaning, analysis, feature extraction, engineering, et cetera, then the experimentation. So training model, selection testing, then deployment, inference, feature, storage, monitoring, and then data ETL, which kind of glues them all together, which are the APIs for the ingest storage and retrieval at any of those stages. And well, even though I said stages, it's more like nodes and a graph because at any stage or node, you could need to go back to a prior stage, or advance to a new stage and do any of those stages multiple times in parallel with different data sets and all of that needs to be systematically monitored, scheduled, tracked. And that process regularly has to be examined with a fine tooth comb

And here's my definition: What I would consider MLOps to be is the study of maintaining many of those such graphs to better achieve some type of business value from the utilization of machine learning, usually within some type of team or organization over time. And so, breaking that down; there's not a team that's putting machine learning into production that's just putting one machine learning model in production. So that's why I say many of those graphs. That's kind of what I mean, you need the tooling to be able to deploy many models at a time test, many variance of those models across different groupings of your users, track those systematically and regularly examine them and have the tooling and the systematic process to be able to actually examine the data that's coming out of that process. And then again, with a team organization, when I was working with Determined AI, I got exposed to that team mentality where you don't necessarily…It's much harder to have business value come out of machine learning in the team organization. If your platform doesn't encompass that team, if everybody's siloed in their own area, doing their own thing, it's much harder to drive business value from actually deploying machine learning, because you're going to lose a lot of your learnings in the process. And so lastly, over time is another thing because it's…MLOps really only rears its head in its need when you do something again and again and again, because like I said, that process needs to be iterated on and improved. And if you don't have the tools to track it, it becomes wildly unmaintainable endeavor.

Simba: Yeah. I have kind of a similar view. I've always say that MLOps is about workflows. It's not about specific parts of…A lot of people think of MLOps, I think as infrastructure like, oh, I need this thing, this is MLOps. But MLOps to me…Sure it requires infrastructure as financial infrastructure, but it's more of a workflow problem. We just need a way to…You mentioned the DAG idea of having that, that connected graph, but the DAG is just one of many workflows. Every company has different workflows, every model, like a computer vision model; how you kind of work with that is going to look completely different beyond the actual algorithms you use. Even just how you label, problems that can occur in production and how you monitor everything just looks different. And so you need…And if you're a huge company like HPE or something versus you are maybe a dozen data scientists at some Fin-Tech company, it's just completely like governance. All these other things come in, it's all workflows and your MLOps Stack is almost a way to like solidify what workflow is

Sam: Absolutely, the infrastructures, the nodes of those graphs, and then MLOps, like I say is the abstraction on top of all of those edges and nodes and maintaining many of them just like you said, you're going to have more than one workflow.

Simba: You talked a bit about like at HPE, you're talking about training kind of the front end of the process, and then now at Redis, you move towards the other side. You also just mentioned howthe MLOps is an abstraction over the nodes, the infrastructure. How do you think about experimentation versus production and MLOps tooling? How does it all kind of…What does a perfect MLOps workflow look like? How does it all come together?

Sam: Well, I can't say for certain what a perfect MLOps workflow looks like, because I think the field in general is still figuring it out. I feel like every year the MLOps space changes quite a bit, but there is a bifurcation of these point solutions versus these kind of more encompassing platforms. And I think it's interesting, depending on the size of your team, just like you were saying you could have a team of data scientists that say, like at HP where there's so many employees, and there's so many teams, and so many groups doing so many different applications that it's really hard to get everybody on the same platform. So you might end up using more point solutions or specific platform for a specific problem, because you can afford to build solutions around it with your team. You can afford to say, oh, we can take on this component and build our own version and maintain that over time because we have the engineers to do so. Whereas I think smaller teams or smaller companies; thinking about like the work on Determined AI for the experimentation side, it's much more different because you have to have more of those features baked in because you're not going to be able to build everything yourself. And I do actually think that's one of the biggest mistakes I see in MLOps space is on the onset sub 10, 20 engineer teams, trying to go out and say, oh, we're going to build this entire stack ourselves. And really that might have been the case, you might have needed to, you might have needed to do that five years ago, but today there's already so many solutions, especially in Upsource that you can readily build on or benefit from or learn from to make your own product or, use their product. 

And I think that that's one of the probably biggest mistakes I see today. But kind of bringing that back to the experimentation side of things; it is really something that I see smaller teams benefiting from more in that process being encapsulated on a single platform, as opposed to kind of building around a number of point solutions, which I think is really only possible if your company or, your team is quite large.

Simba: Yeah, that's super interesting. I guess I'm now thinking about Redis, you mentioned like point solutions versus platforms. How does Redis fit into that mob stack? And I know at Redis has been releasing a lot of the new functionality around them, all use cases. I'm really curious to hear one, what you've built, what's coming out soon. And two, like what the long term vision is for Redis and ML Stack.

Sam: Yeah, absolutely. So one of the things I love about Redis is that it's used everywhere. So job scheduling, celery, airflow, task brokering, training as a parameter store, of course, inference for [inaudible 1Sam:36] AIs or model storage; but mostly today, the MLOps life cycle, Redis is being used as an online feature store. And so what I mean by that, for those of you who are unfamiliar with online versus offline storage; online feature storage would be when you have the need for low latency, retrieval of data for a production model that's been deployed for some service or application, whereas offline is the data that may be sitting in an S3 bucket, or even on File or Buster, what have you, which you're using for some training or experimentation process or feature selection or EDA. And the reason Redis is really equipped to be an online feature store is because of what's on paper; performance, sub millisecond latencies, you can run latency, critical workloads, which just aren't possible with a lot of the other managed database solutions presented by cloud vendors. Scale, with Redis Enterprise, you also get Redis on flash. And when you have the ability to also use flash, in addition to MRA database, you don't lose a lot of the latency and you gain massive scale. Personally, I've stored terabytes up to petabytes of simulation data in a single Redis cluster deployment. And so it's really scalable when it comes to data size and also total cost of ownership. And what's also really beneficial about being open source and being Redis is that if you don't have the engineers to manage maintaining your own Redis open source cluster, you just go sign up on Tech Times and they'll manage one for you or any number of the open source or managed feature services that are out there that are building on top of Redis because of what's on paper.

What's interesting to me though, is what's not on paper. And this is the reason I personally, as a developer, really like working for Redis is the ecosystem; the feature forms, the feasts, the client libraries in 30 plus different languages. It's probably already in your stack as a cash. So you're not adding it as a dependency. The number of developers that are available to hire, because they're familiar with Redis because of one other job that they've already done and no cloud lock in, no hardware lock in, it works just about anywhere, but all of those things are building towards this new vision that we have for what we're rolling out of the MLOps space, particularly with the addition of Redis Stack. If you haven't heard of Redis Stack, it's essentially letting you try out some of the features of Redis Enterprise in the module system, without really even using it. You can just download the Docker container, actually just made a demo with it, but it allows you to start integrating those third party solutions to the model ecosystem, which is just really beneficial. 

You're talking about Redis Bloom for bloom filters, Time Series for time series data, Redis AI for model storage and inference with PI Torch, TensorFlow—OnX, which is the one I used back at HPE Redis Jason and Redis Search are also two incredibly popular ones. Redis Jason essentially turns Redis into a document database and with the OM clients for object management, you can use really popular frameworks, like fast API IDENTIC, to be able to store or retrieve your data from Redis and in that kind of nice sequel outcome like way and a lot of developers really like that experience. And that's what we're hammering home over the next few releases is, is that kind of vision for the module ecosystem.

But what I mentioned last is the one I'm most excited about which is Redis Search and the search capabilities, the vector search capability that's coming out in Redis Search is actually public preview right now, but essentially it's an online storage format for vectors. And if you're not familiar with vector embeddings, essentially vector embeddings allow you to take some piece of unstructured data and turn them into some…Let's say basically just a list of numbers really at the end of the day, 5, 12 byte vector that you create with models like Hugg and face transformers if you're using text data, sentence transformers. And what they allow you to do is compare those vectors store and compare those vectors in an index database, traditionally call a vector database. But with Redis Search, essentially Redis becomes that vector database and it requires no extra third party solution, no extra dependencies in your workflow. And it comes with the whole open source gourmet of support that Redis does. And really all we did is add those indexing storage types to Redis. 

And we have two different index methods, Brute Force and HS, but it's a massive benefit for the utilization of unstructured data. And in general, that's one of my favorite areas that was really becoming popular in machine learning, because there's so much unstructured data that's out there that we haven't necessarily always benefited from because it's not an ATA data format. It doesn't fit into Excel and data analysts that aren't engineers can't benefit from them. Well, unstructured data brings us this opportunity to have a data format that we can present interfaces on top of that really allow for non-technical people to benefit from massive amounts of unstructured data. And I think that's a really interesting emerging area of MLOps.

Simba: Yeah, it's so fascinating to watch all these different spaces and different kind of use cases. I mean, a vector database is something that I've probably had to build or have together really like probably five times in my career already when we opened sourced like a little thing into space. And it was funny even with the first release, which was very like hacky, like people were like, oh, thank God, like I've built this like a hundred times. So I'm really excited to see Redis is kind of built something…Built into, I guess, built into Redis. 

And it gets to the next thing I want to talk about, which is, you mentioned, you already touched before on platform versus best in class, and Redis is interesting because when you talk about online stores and features like Redis kind of looks like a best in class style point solution. Well, all of the things you've mentioned adding kind of turn it into a data platform again, I'm curious to learn more about how you think the space will evolve and this platform versus best in class solution way and how you even think about one MLOps platform versus kind of connecting maybe like a dozen best in class MLOps solutions. When does one make sense? When does the other make sense? How will it look like in the future? Will platforms all be best in class point solutions? Will they coexist? Just curious to hear you think about that.

Sam: So basically what's changed over the last couple years in this bifurcation of MLOps solutions and platforms is that people actually started using machine learning in production and driving business value from it. When that happens, there's a margin that companies, MLOps companies can make from optimizing the ways in which other companies utilize and benefit from machine learning. Hence the creation of MLOps companies, if that business wasn't there, MLOps companies wouldn't exist. And the thing is what's happening is that these MLOps companies are also targeting smaller companies that can't do everything themselves. And so that's why you see both point solutions and platforms succeeding in the market is because there's such a breadth of companies that are benefiting from ML, that really there's so many different needs because of the different companies sizes that are out there.

And I don't think this is slowing down. I don't also think that one is going to win out over another. I think there's always going to be smaller companies and there's always going to be larger companies. And the truth is, is that both point solutions and platforms have a place in that market, especially if it keeps expanding, which we all know the term for this kind of market is going up. And so I do think that there will be winners and losers, I'm not saying that, but I think that there will always be a place for both types of solutions. Speaking about Redis because obviously I know it well, is that Redis on one side can be your point solution; if you want to build your ecosystem around the bare bones of open source key value store, you can do. So if you want more of a platform, you can plug in nine different modules and make it do exactly what you want to, and then it's your data platform; it can be your durable database if you want to. 

And so that is how I see a lot of these point solutions actually evolving is that they will grow to outpace their own market that they've been successful in. And some of them will adapt and grow their feature set to the point where they can then become the platform. But I do think that some of them will stay that point solution and most likely succeed for a short amount of time until some better point solution it becomes best in class, but I think both are around to stay for quite some time.

Simba: More specifically, what is MLOps space in your mind, I mean, no one really knows, but I'm curious to hear your perspective. How will it look in, let's say three years, three or four years, how will it have changed? What do you think will be different?

Sam: I do think that there will be more options that are affordable. That's the first thing. So there's a lot of things today that if you want the enterprise version of, you know, take the Determined or, any of these hosted services, you really need a reason or a system that is quite out of the reach of some smaller companies. And I think there's also a…It's interesting, a few different players in this space, I think makes sense to grow a lot more over the next three to four years. And I think there are roadmaps that I can see ahead of them that are very clear. For instance, Hugging Face, I think they're making a really, really smart choice to basically become the GitHub of models. It's never been easier to pull down a pre-train model and benefit from it. That is going to continue I do know that, or I do believe that I should say. I do believe that the open source push in  MLOps will continue; more and more people are finding out ways to benefit from their product being open source and having integrations readily available because those APIs are available to other users. 

And an example like Hugging Face really shows you where that succeeds; being able to use your Hugging Face model in an instant benefit from it. At some point you're going to end up giving, Hugging Face some money, somewhere down the road, you're going to end up giving them some money if you keep using it because either that or their business value isn't there or their [inaudible 23:15] I should say. But no that's something I do believe is that especially those companies that are building communities around their product, especially like the model hub or repo or providing spaces for people to have discussions that is really that stickiness factor that keeps companies around. And I believe those companies are doing the right things with the ones with the really large open source ecosystems with lots of developers, they just don't go away. 

So I can tell you that I do believe those companies will stick around in the next three years. From a broader perspective though, kind of backing out, I do think the MLOps space is definitely right now, there's lots of different tools that are trying to do a lot of different things. Some of them will become irrelevant because of the advent of new technologies. There will be, say for example, that an exceptional model comes out that never drifts. I know this is kind of wild to say, but say that some researcher comes up with an idea that is a better version of XG Boost and no matter what, it doesn't need active learning, it never drifts it's almost perfect. Well, every model monitoring company takes a hit at that point. And that's just how quickly, and even though it's kind of a silly example, every one of these companies still has exposure to that kind of research coming out. 

And I do think that machine learning moves at that pace. I mean the deluge of papers that I get into my newsfeed every morning that I'm trying to keep up with is just astounding. So I think that's the other side of things that I'd like to point out from a kind of higher level is that this research moves really fast and that some companies will find it hard to live in that world after that research comes out.

Simba: Yeah, it's really interesting. You mentioned that…I mean, one thing we've seen is we release support for embeddings and vector search, so beyond even just like doing the actual storage of it, just giving you the API in the feature store so that you can actually do a nearest neighbor look up. And that's been a huge thing because embeddings are just becoming, when we used to do embedding at my last company for recommendations. And then it was very, very unusual to see people actually putting embeddings into production, especially outside of NLP. And nowadays it's like everywhere and there's even like arguments of in the future most architectures will look that way at big companies especially that have the expertise and the scale of data to be able to pull that off. And that changes the whole ecosystem around it, right. Vector databases probably weren't a huge problem or something that could be a big thing, say five years ago, but now it's like, yeah, there was no embeddings. Like where do I…Whether there embedding, so this like…I can just…But now it's critical. I mean, even back then, like having and all those libraries was platy and now it's like they're just indexes we need a whole database around it.

Sam: And it is a really interesting space; it's because of that ease of use that people are starting to use it now. I mean that you can take any text blog at a Hugg and Face model and be like, oh, I have an embedding now that I can compare it to the other thousands of embeddings in my data set and provide you some similar piece of text for document retrieval or information retrieval. And every year at nerves, there's 30 more use cases that come out for that kind of unstructured data. So it's really interesting to see the need for those vector databases kind of grow.

Simba: Yeah, and it's been interesting to watch. I mean, obviously Redis has been kind of a critical open source project for a long time now. And watching the ecosystem move from, I mean the early MLOps companies were almost all proprietary. I'm having trouble even thinking of any of that weren't and nowadays there's just so many open source projects coming out, which is great. And for the ecosystem I think, and yeah, I can definitely see how in the future, I mean, people like open source because it's almost premium for engineers

Sam: Yeah, that's a good way of putting it.

Simba: And it's like, I own it. I own the data, like sure, I have to do more work, but I know I can keep using it for free forever. And if I have budget and you know, this product is critical to us and it's functionality that we need, that's on the open core we need, we'll pay for it. So yeah, it's very interesting and it'll be very interesting to watch all this play out. What are you most excited for? I mean, you mentioned quite a few things that you're excited about, but if you had to like narrow it down to like, hey, like this is something I'm most excited about right now in MLOps, what would it be?

Sam: It's funny. I was looking at the feature form website and on the website there's a line that says, “The days of untitled 128.IPI notebook are over. And as someone who is getting into this field over five, six years ago from working on HPC, I'm really excited that that's over. Honestly, when I put together this spec similarity demo, it's actually on GitHub. It basically uses some pieces of Redis to do visual and semantic embedding search or similarity, I should say. And it was so much easier than I thought it would be; Just the amount of products that can be produced, the amount of models that can be deployed, the process is so much easier than it used to be. You no longer have to save 128 IPI Notebooks; they can be versioned and managed for you. I'm really excited about that just from a general sense, like the easier this gets, the better everyone, not just engineers, the better everyone's life becomes because the better products we can produce. 

And I am really excited about that just from a high level, but that is one of my favorite things that is going on right now. But like you said, also it's the open source component. And I think that has some different components, too. People are really buying into the open source ecosystem for machine learning. It gives you that like kind of trial API sense that we were talking about, but it's also something that it leads to kind of greater productivity because you can check out how it works. You can look into it, you can modify it if you need to and suggest, a PR, you can make something work with your system because you can integrate it before you even decide to spend a penny on it. And that is something that I think is really beneficial. And I'm really glad that a lot of companies are really embodying that kind of open source push. But then, the last thing is something we've already talked about, which is vector beddings. There's never been more unstructured data and better methods to utilize that unstructured data than right now. And every second that we keep going, it gets better and better. And I'm just really excited to see what the space looks like come a year, two years’ time.

Simba: I'm also super interested to get to see some of these demos you've talked about.  I'll get some of the links from you so we can link them up in the description, I think that will be… I'm sure a lot of listeners would love to play around with some of what you're talking about. Untitled notebook is definitely something. I mean I've even like been in funnier situations where it's like, hey, there's this Google doc where you just copy and paste equal snippets from; like I've seen a lot of different things. These are our big companies doing this, like companies that we would…

Sam: Oh, yeah.

Simba: …See as like, oh, these companies are so technical, there's this…So good at all this, and it's like, yeah, but they use Excel to like keep track of things. They may use Google docs to like keep for sequel snippets in different teams. So there's a lot of work to be done. There's a lot of work to be done, and I think people who are really keeping them MLOps space, I think sometimes like, oh, there's just so many tools. Everyone knows about MLOps. I went to a Fortune 500, I talked to a principal ML engineer and he had just kind of learned about the term MLOps not that long ago; it was something that was still…I mean, that's how early we were in the space where the biggest companies, they have ML engineers, but they're working on the proprietary thing that they built  a decade ago because they've been doing ML for so long and we're just kind of opening up to the rest of our ecosystem of like, oh, all these ideas, like a feature store, like we called it a data platform, but that makes sense, it's actually a feature store. So it'll be really interesting to watch all that play out. 

Last thing, what's the tweet length takeaway, almost like a TLDR that someone should take away from this podcast?

Sam: Interesting. A tweet length take away, can I do two? 

Simba: Yes. That's serious.

Sam: Can I do three?
 
Simba: Yeah. We can [inaudible 3Simba:43]

Sam: Okay, yeah. Tweet thread. So I'd say for those in the application space that are kind of using MLOps tools that are benefiting from them or just benefiting from ML, chances are, someone's tried what you're doing before, find them and talk to them and learn from them. And you don't have to do everything yourself that would be for the application space. And then I'd say for MLOps services or people creating platforms or point solutions, user empathy goes a long way. If you haven't done or tried to do what users are trying to do with your service or platform, acquiring new customers will be really difficult. So iterate, prototype and learn as much as you can.

Simba: I love that both those things are very I guess, how to work in the space and less about like very specific things. And I think that's kind of where we should be at or how a lot of people in the space should be thinking. We're still on learning mode. We're still figuring out even just the basics.

Sam: Every day. 

Simba: Yeah, exactly. Sam, it's been so amazing having you on. Thanks so much for answering my questions and chatting with me. I'll include some of the links that you talked about and they'll just be a link to your LinkedIn and anything else you share and the description. Yeah, thank you again so much. 

Sam: Well, thank so much for having me. This was fun.

Related Listening

From overviews to niche applications and everything in between, explore current discussion and commentary on feature management.

explore our resources

Ready to get started?

See what a virtual feature store means for your organization.