Transcript:
[00:00:03.000] - Simba Khadder
Hey, everyone. I'm Simba Khadder, and you're listening to the MLOps Weekly Podcast. Today, I'm speaking with Doris Xin. She's the CEO and founder of LineaPy, which allows you to move your data science prototypes into production easier. She started her career at Linkedin, where she was a software engineer. She then, throughout her career, has worked on MLOps and ML before it was cool, across Databricks, Google, Microsoft, and more. Doris, so great to have you on the show today.
[00:00:32.620] - Doris Xin
Thanks so much for having me, Simba. Super excited to talk to you about MLOps.
[00:00:36.420] - Simba Khadder
I gave a little intro of you, but I'd love to get it from your side. What's your story? How did you get into MLOps?
[00:00:42.700] - Doris Xin
Great question. My journey actually started about a decade ago. I did computer science in undergrad, pretty much took all the AI, ML classes I could possibly get myself into, got really excited about the field. My first job out of college, I was what we now call a machine learning engineer. Back then, it wasn't even a title yet.
[00:01:03.600] - Doris Xin
Team had over 30 ML PhDs on it. It's really funny that your last two guests were all on the same team at the same time with me. But that's where I really got the first exposure to industry machine learning, and also all the challenges around how do you put models into production. Once you've done the experimentation, that's really just the first step.
[00:01:24.850] - Doris Xin
There are so many other things that you need to be able to take on, including creating pipelines, including creating dashboards, and that's really the bulk of the work that actually allows us to generate value from machine learning. This is where I really got into not only thinking about how do we productionise machine learning, but how do we build developer tools to really support people to be a lot more productive and a lot more impactful with their data science and machine learning efforts.
[00:01:52.800] - Doris Xin
I ended up doing my PhD in machine learning systems with a pretty heavy focus on developer experience. My mission was to be able to help people iterate on models faster, be able to bring things into production faster. This is a mission that's been following me around for a decade, and it continues to follow me into what we're doing at Linea today.
[00:02:12.750] - Simba Khadder
For us listeners who don't know, what are you doing at Linea today?
[00:02:15.860] - Doris Xin
Thanks for asking that question. I am founder and CEO of Linea. We are an MLOps startup, just like Featureform. What we do is we're helping data scientists take raw development code, be it in Jupyter Notebooks or raw Python script, be able to clean it up, refactor it, turn them into data pipeline, so we can really shorten the cycle from development to production.
[00:02:38.780] - Simba Khadder
I really want to jump into notebooks, just because notebooks are such an interesting and polarising topic found in MLOps. Some people will go as far as to be [inaudible 00:02:49] like "They should never be used. They should be as far away from production as possible," to the other camp of, "Notebooks are the atomic element of data science. That's what we should be working off of." How do you view notebooks in the data science process? How should they fit into production machine learning?
[00:03:08.270] - Doris Xin
Like many others, I also have very strong opinions on this front, but perhaps in a slightly different way. Notebooks are extremely well-suited for data science development. Data scientists are constantly looking for really fast insights and they need to iterate very quickly on their data. Notebooks with their incremental views, with their incremental execution is really well suited for that type of work.
[00:03:31.320] - Doris Xin
However, when we're talking about production environment, you have a very different set of concerns. We care about scalability, we care about stability, we care about all sorts of other good engineering practices. While you can repurpose your notebooks to be able to fit what you need in production, it really feels like we're misusing a very powerful tool that's meant for a different purpose. Because of that, I think, again, notebooks are great for development. Putting them into production simply feels like it's the wrong form factor for what needs to go into production.
[00:04:06.130] - Simba Khadder
I'm in the same camp. I think notebooks… The name Notebook is referencing the scientific tool of using a notebook and journaling and keeping track and thinking and experimenting, documenting everything you're working on. But productionising is a whole different game. I guess how would you frame the workflow then? If I'm a data scientist, I'm working on my notebook, I'm coming up with insights, I'm coming up with ideas, features, models, everything. I want to start getting things into production. How do you think of that workflow? How should it work?
[00:04:37.160] - Doris Xin
I think what we really need to go from development to production is being able to reproduce what's happened in development in a high-fidelity way. The issue with notebooks there is that because you're able to execute cells out of order, modify cells, and overwrite your work, a lot of times the end result for a notebook is not something that actually allows you to reproduce the exact model that you are hoping to productionise.
[00:05:03.340] - Doris Xin
This is something that Linea has a pretty unique solution to, which is while you're working, there is that execution history that does fully capture how you got to the final model. If we're able to capture intermediate states, if we're able to eagerly capture everything during development, there is a path forward for being able to revisit the entire session and figure out what is basically the necessary subset of operation that led to the final what we call artifact. Artifact can be a model, chart, data set, what have you—whatever a data scientist is working towards.
[00:05:39.150] - Doris Xin
The key difference here is that we embrace the chaotic nature of development. We understand there's a lot happening, and we really don't force data scientists to be constantly thinking about software engineering. "I need to clean up my code. I need to make sure that I record everything." That's just not what data scientists need to focus on when they're developing data insights.
[00:06:02.270] - Doris Xin
We take an approach of, "Let us worry about all of that. Let us eagerly capture everything." We understand that most of the stuff is going to be throw away, so we have a very intelligent way of saying, "We can figure out automatically what is the important subset of operations that will eventually make it into production."
[00:06:20.280] - Simba Khadder
You've touched on it a little bit. But just so it's clear, I guess, what's a unique take you have on MLOps? In designing Linea, what's the opinion you hold strongly that you think maybe isn't currently obvious to MLOps practitioners?
[00:06:37.430] - Doris Xin
We see a pretty big divide between the development tool stack and the production tool stack, and there is a constant need to be able to translate from one to the other. The key insight that we have is these tool stacks, even though they look very different, they're describing the exact same logical workflow. If there is a way for us to be able to capture that and be able to distill that out of the development side, there is a path forward where we no longer need people to essentially translate themselves into a different tool stack.
[00:07:08.880] - Doris Xin
That's the key insight that allows Linea to bridge the gap between development and production by automating a lot of the engineering concerns around cleaning up notebooks, cleaning up scripts, and then translating them into pipelines that can be deployed into production.
[00:07:25.860] - Simba Khadder
Maybe jumping into a slightly different topic. I really like what you're saying because I do think one thing I've seen, at least, with a lot of MLOps practitioners, is they don't really empathise with the data scientist, who's really the end user. A lot of what we focus on—and clearly, what you guys focus on, based on what you're saying—is trying to meet data scientists where they are and not trying to force them to, like, "You're doing everything wrong. You should be writing this in Java."
[00:07:52.000] - Simba Khadder
One thing that I think that's been really interesting for me to see has been almost an explosion of MLOps in the past few years. You've been doing MLOps, like you said, for a while now, and you've seen how it's changed over time. You've seen the term come to existence. Why do you think that is? Why did MLOps blow up in the last couple of years? Why didn't it start earlier? What's the "why now" of MLOps, in your opinion?
[00:08:18.040] - Doris Xin
That's a really great question, and a very big question. A couple of thoughts here. The first one is really about, a decade ago, it was only the really big tech companies that were able to build a team and have enough data to be able to create machine learning products that are actually generating value in production.
[00:08:36.860] - Doris Xin
What we're seeing now is that that barrier to entry is much lower. You're seeing an exponential number of companies who are able to start using machine learning data science to be able to impact their business metrics in a very meaningful way. With that, comes the challenges of... A lot of these teams are fairly small. You have a handful of data scientists. But as you and I know, productionising machine learning takes a whole village.
[00:09:07.320] - Doris Xin
You really need a lot of infrastructure around engineering, around continuous development and continuous integration that allows your machine learning products to continuously deliver value in production in a healthy and safe way, really. This is where we're starting to see that the realm of MLOps really starting to have a lot of activities, especially around developer tools, because we are, in a way, democratising data science and machine learning, to a much wider audience.
[00:09:40.860] - Doris Xin
Because of that, we are in dire need of standardisation of entire field, and that's where I believe MLOps came to be. We're starting to have common vernacular around how do we think about the different components that all go into a machine learning application and how do we think about the different tooling that all needs to be created and standardised and made available.
[00:10:06.010] - Doris Xin
I think we're still fairly early in the MLOps journey here. Standardisation is still very much a big question mark. I think there's a lot of desire to move in that direction. In terms of our progress, I think we're still learning the exact set of steps, the exact set of tools that need to be built in order to support the end-to-end lifecycle.
[00:10:27.410] - Simba Khadder
One thing you said that was really interesting, I totally agree with the premise of we're super early in MLOps. I think if you look at the number of tools and categories, people might feel like, "There's so much. It's so noisy. It's saturated." I actually have the exact opposite feeling. I think there's so much because there's such a big problem to be solved and no one knows exactly how it's going to look.
[00:10:50.580] - Simba Khadder
In a lot of other spaces, we can look at people and be like, "That's how it should be done." Like with DevOps, you look at Google, they're doing amazing. That's like the gold standard. With ML and MLOps—at least in my experience, I've talked to lot of companies, and a lot of companies do things great—but I haven't met one company yet where I'm like, "They've got it. This company has it nailed. End to end, this is how everyone should do machine learning."
[00:11:13.340] - Simba Khadder
I think there's just a lot of growth to be had still in the whole space. I think we're just at the point now where we're starting to give these concepts names, or feature stores as an example. I built them before, but we used to just call them data platforms. Machine learning data platform. The term feature store didn't really exist until recently. We're just getting to the point where we're naming everything, but we're still figuring out what the stack really looks like and what a real example, or I guess that real use case, looks like.
[00:11:41.110] - Simba Khadder
One thing I want to zoom in on that you talked about is it takes a village, in your words. I totally agree. I think a lot of the problems to be solved is actually organisational scaling. Organisational problems tend to be the biggest headaches, at least that in ML teams. How do you think about that with what you're doing at Linea and just in general? How should a data science organisation using MLOps function? How do you scale machine learning across many teams?
[00:12:07.090] - Doris Xin
When we think about the teams, I think a couple of different personas come to mind. You have the data scientists who are mostly responsible for development, you have the data engineers or ML engineers who are responsible for taking the output of the data scientists and doing all the packaging, all the translation, and then productionisation afterwards.
[00:12:27.760] - Doris Xin
I think this is going to change in the next few years for a couple of reasons. One is that not every team has access to data engineers and ML engineers. I've heard that it can take upwards of 18 months to be able to fill one of these headcounts. That means until you have that talent gap filled, a lot of your data science investment, it's simply investment, there is no return on that.
[00:12:52.190] - Doris Xin
Even if you do have a team of data engineers, you still have an extremely high handoff overhead between the two personas. A lot of times, making sense of what happens in production requires a lot of data science domain knowledge. With development, sometimes you also need to keep an eye on productionisation concerns. Really, we're starting to see more and more of what I call the full stack data scientist.
[00:13:18.740] - Doris Xin
You're starting to see one person scaling more horizontally across the entire lifecycle, and then the data engineers and the ML engineers, they become more infra people who can scale themselves much better by creating tools that are commonly shared across many data scientists rather than being pointwise solutions to support one or two data scientists in a very ad hoc fashion.
[00:13:41.900] - Simba Khadder
Do you think that continues to happen, that unification? Do you think data scientists' ownership will continue to grow as the tooling gets better? Or do you think that the data scientist's role rather will fragment into maybe other specialisations?
[00:13:56.200] - Doris Xin
I see the trend of unification to be the trend of the future for a couple of reasons. The first one is that data scientists care about impact. I don't think they're happy to settle on, "I did my development. This is the end of my job. I'm going to throw it over the wall. If it makes into production, great. If it doesn't, I've done my job." I don't think data scientists operate in that way.
[00:14:19.300] - Simba Khadder
Therefore, it means that we're going to see data scientists wanting to take on more of the lifecycle, wanting to unblock themselves and wanting to have more visibility into impact, into production. This is, I think, going to be a key driving force towards having more full stack data scientists.
[00:14:39.250] - Doris Xin
For the longest time, we have been seeing the unbundling, the segmentation of not only the data science teams, but also the infrastructure itself. We're starting to see more and more pointwise, best degree solutions. But I think this fragmentation is not going to be sustainable. I think we're looking at pressing needs, actually, for this reunification to happen soon.
[00:15:03.190] - Simba Khadder
Do you think the future is more MLOps platforms? Do you see it as a set of best-in-class tools? You said, obviously, the data scientist role you think will unify. Do you think that the tooling will also follow that same trajectory, and over time we'll begin to consolidate into MLOps platforms?
[00:15:21.020] - Doris Xin
I think we'll definitely start to see some form of consolidation. I think the caveat here is that we're not going to see the end of best-in-class pointwise solutions. What we're going to see is we're going to be a lot more thoughtful around how to build integrations that allows us to be able to take advantage of these tools in a way that's not just building up a ton of management overhead on the part of the data scientists and data infrastructure teams.
[00:15:46.780] - Doris Xin
In that sense, we're moving towards platforms, but not in the traditional sense of, "This platform is going to come in and replace all of your data science infra and be able to serve all your needs to the best of its ability." It's really going to be about creating a platform more like a common substrate for us to be able to plug in really good pointwise solutions and be able to swap them out as the team grows, as the use cases evolve. But there is a common platform in the sense that you no longer have this chaotic bag of tools that people are constantly juggling.
[00:16:22.440] - Simba Khadder
It's really interesting how you're putting it. I have a very similar belief. I have traditionally put it in a slightly different way, which is there will be almost like a Kubernetes, there will be this operating system for data science or machine learning. Not sure what it will be, but something will become the base operating system and abstraction that everything else builds off of.
[00:16:44.510] - Simba Khadder
I think maybe part of the problem now is... Like the feature stores, as an example. There are a handful feature stores in the market. I think they all look very different. I actually think that they all have very different definitions on what a feature store even is. It's not even really clear. I think the same could be said of monitoring. The same could be said of a lot of these categories that have multiple players where each player looks quite different. They call themselves different things, they focus on different parts of the problem.
[00:17:13.100] - Simba Khadder
It almost sounds like what you're saying, and let me know if I'm missing it, but it feels like what you're saying is, over time, the categories will clarify. It will be like, "Yes, this is what a feature store is and this is how it fits into the bigger stack," as opposed to nowadays where it feels like there's all this tooling that looks like weird puzzle pieces and it's your job as a team to take all these jagged puzzle pieces and smooth them out and fit them together.
[00:17:40.090] - Doris Xin
I think I largely agree with what you said, but I want to tweak the position just a little bit. The reason that things are a little chaotic right now with ML... Or not a little, a lot chaotic with MLOps space is that we don't seem to actually agree on the set of problems that we're solving. Each of the tools is trying to pick up a subset of the problems that we may or may not agree on whether we need to solve.
[00:18:01.260] - Doris Xin
For example, for some teams, they don't worry about real-time serving. For some other teams, their model or data never drift. For other teams, scalability isn't an issue. I think we're starting to see a bunch of tools that are trying to package multiple problems into the same solution, and we're having a hard time understanding is there a clear one-to-one mapping between this set of problems that my team cares about versus the set of tools out there.
[00:18:31.360] - Doris Xin
I think the clarification of categories probably will come in the flavour of problems first, and then we'll start to see standardisation around, "These are the sets of tools that will solve these problems." Then we'll start to see how integration happens, because the problems are naturally interconnected and therefore the solutions will have to be really smart about figuring out integrations.
[00:18:53.980] - Simba Khadder
I see. I think that's spot on. I think when I talked to James Alcorn, actually both of our investors, about this, one of his takes that I thought was interesting was that there'll be these best-in-class point solutions, but there will be platforms, MLOps platforms, that focus on specific use cases. Computer vision looks very different from tabular data doing fraud detection or something. All these use cases, the problems that are sharpest are just very different, and so the tools they want are different. That's the only place a full-on opinionated MLOps platform makes sense.
[00:19:28.420] - Simba Khadder
I think on the other end, like what you're saying, I think the problem to be solved in most cases… I'm very curious to get your take on this. When I talked to David Stein about this, he was like, "The big thing is naming." Naming in the sense of creating abstractions, making a feature—as a feature store as an example—not just like rows of data, but something with a name, an iversion. Once you have abstraction, you can build on it. It feels like a lot of the problems to be solved are abstraction problems, like building the right abstractions that people can build atop of. Do you buy that? Is that fair?
[00:20:03.520] - Doris Xin
Yeah, absolutely, 100%. I think this is what motivates what we do at Linea and LineaPy, the open source project, as well. It's really about building smarter abstractions at a level that's sitting above all the tools. There are, like you said, a fundamental set of jobs to be done, fundamental set of tasks that these tools need to fulfill, so if we can describe the problems and the needs at these higher levels of abstraction, then it becomes much more productive for us to start talking about solutions.
[00:20:35.670] - Simba Khadder
I think so. I think we made up the term virtual feature store to describe what we do. But my maybe hot take is that every MLOps tool is a virtual something. Needed is not better compute infrastructure or whatever. What's needed is better abstractions on top of the infrastructure that already exists. We don't need better notebooks, we need the better abstraction about notebooks and other things that fits into what will eventually become the standard ML lifecycle.
[00:21:03.860] - Doris Xin
Yeah, I agree with that 100%. I think speaking of meeting the data scientists where they are, there is a mental model there. That's a much higher abstraction level than, "Oh, I'm using Notebooks. I'm using this very specific orchestration engine." It's all about what I want to get done and being able to capture that in a more appropriate abstraction level. It's still key.
[00:21:28.350] - Simba Khadder
I want to bring it back to Linea. Obviously, if there's a open source project, we'll have a link to it. People should definitely go check it out. It's really cool. I know you mentioned that the proprietary solution is coming out soon. I'd love if you could maybe give a sneak peek on what people should expect and be looking for over the next few months.
[00:21:46.860] - Doris Xin
For me to answer that question, maybe just a quick overview of what we're doing with open source solution, and then it'll make more sense what we're trying to do with the platform side of it. Very quickly, LineaPy, the open source solution, is a bridge between data science development and production.
[00:22:03.810] - Doris Xin
It's a very easy-to-use Python Library that sits in your development runtime to be able to automatically capture all the code as well as runtime state while you're working, which allows us to build this implicit DAG, the data scientist's building, as they're working. We call this the linear graph.
[00:22:23.750] - Doris Xin
Using this graph, we can do a few very powerful things. The first is that we can analyse the graph to figure out what's the subset of operation that happened, which is probably less than 10% of what actually happened during development, that led to the final product. Then the other piece is that, because we are modeling everything as a DAG, we can actually have a pretty straightforward translation from what happened during development into these DAGs that are actually supported by workflow orchestration engines.
[00:22:56.680] - Doris Xin
This goes back to your point about building the right level of abstractions. The abstraction isn't an airflow DAG or a ray DAG. The abstraction is my data pipeline, as I expect it to be spec'd out at the logical level. That's, at a very high level, what the open source solution does. A couple of really powerful things that it also does is that we're able to extract out common components across multiple artifacts based on our understanding of the workflow.
[00:23:25.570] - Doris Xin
Now, all of a sudden, you get for free a lot of these reusable components that can either be imported into future development as a macro or shorthand to do something that's already been tested in previous development, or you can pull out these components and pull them directly into future data pipelines. That's what LineaPy does.
[00:23:47.880] - Doris Xin
Then you can suspect that on the platform side, right now, we're assuming with LineaPy that you already have a pretty robust in-house production environment. But that's a tall order for a lot of teams out there. With the Linea platform product, we want to be able to bring to many teams who currently don't have the data engineering resources an out-of-the-box push button platform for production that LineaPy, the open source product, integrates with very seamlessly.
[00:24:17.400] - Simba Khadder
I'm super excited to see it as it comes out. I think it will be interesting. The open source project was really interesting to me beyond how cool this is as a product. I think it was just a new take. I feel like things were starting to get a little bit... The categories are starting to form. I felt like the categories that formed is a local minima and not actually where we'll end up. I always love to see almost new categories get created. If you had to give a name for the category that Linea falls in, what would you call it?
[00:24:46.690] - Doris Xin
That's a really tough one. I guess we throw around the word transpilation a lot, like translate and compile. We think of LineaPy, the open source, as this transpilation layer between development and data science production. But with the platform solution, I think it's really about back to the original theme that you mentioned of unification.
[00:25:09.830] - Doris Xin
How do we unify without forcing people to adopt this very heavy-handed platform solution? It's all about being able to create a common substrate, so maybe data science substrate or data science platform substrate that allows us to support seamless integration across multiple solutions. Featureform, for example, can be a solution that gets plugged in to the linear platform that then can also plug in other things like data monitoring and a suite of other things that the data science teams require.
[00:25:44.580] - Simba Khadder
It's like a virtual compiler. It's cool. Like an MLOps compiler. It is really interesting, automatically being able to take things to that next level of abstraction just implicitly. You can just look at it and just be like, "Cool, I get it. This is what you're trying to do," and I can take it to that level of abstraction and give you the right namings of things, like components you need to make this. Not just Python or lines of Python, and actually make this machine learning artifacts.
[00:26:11.540] - Simba Khadder
It's really cool. I'm really excited to see as the product continues to grow. I feel like I'll keep talking to you forever about Linea and all the stuff that you're working on. This idea of the right abstractions and what those are is something I think of all the time, and I know it's something we've talked about in the past. But for people listening who want to go back and tell their team about this great podcast and what they learn, what's a tweet way of takeaway they should have, they should share about this?
[00:26:39.320] - Doris Xin
I think the key here is really about building bridges and breaking down walls. We want to bridge the gap between development and production. We don't want to throw things from the data scientist side over the wall to the data engineering side. I think what's productive is for us to figure out how to come together and work together with the right abstraction as the common language that we both speak.
[00:27:05.300] - Simba Khadder
I love that. I think there's a wall to be broken down and a new abstraction to be made of what a data scientist is and does, which you've talked about. I really like how you put it, like breaking down the walls and building the bridges. It's almost like building the right abstraction of what a data scientist is and the data engineer is and how they all fit together. Doris, this has been such a great conversation. Thanks so much for hopping on and chatting me.
[00:27:26.660] - Doris Xin
No, thanks so much for having me, Simba. I had a lot of fun answering your questions. Hopefully, folks have something interesting to take away from this.
[00:27:34.460] - Simba Khadder
I know I did. I'm sure some of the people will go and take a look at Linea. I think what you're doing is really awesome. I'm really excited to see the project continue to grow.
[00:27:42.860] - Doris Xin
Absolutely. Please go check out lineapy.org. We're on GitHub. Star us to be able to follow our updates. We also have our weekly updates in our Slack channel. You'll be able to find the link to the Slack on our GitHub page as well. We just have a lot of great things coming up, so keep an eye out for Linea and LineaPy.
[00:28:01.730] - Simba Khadder
Awesome. We'll have these links in the bottom. Thanks again, Doris.
[00:28:05.010] - Doris Xin
Thanks, Simba. Take care.
See what a virtual feature store means for your organization.