For Episode 23 of the MLOps Weekly Podcast, Simba chats with Maxim Lukichev, Co-founder and CTO at Telmai. They discuss the importance of a proactive approach to data quality, improving collaboration on data teams, and the critical value of Data Ops in Large Language Models.
Listen on Spotify!
Hey, everyone. Simba Khadder here, and you are listening tothe MLOps Weekly Podcast. Today, I'mspeaking with Max Lukichev, who is the CTO and Co-founder of Telmai. He has a PhD in computer science and database theory specifically, and spent over adecade building SaaS products with a focus on big data analytics and ML invarious startups at various stages, ranging all the way from inception toacquisition to IPO. Max, so good to have you on today.
Thanks for inviting. Very glad to be here.
I'd love to start by learning about the inspiration behindTelmai. Why did you decide to build a company around this space?
Well, it's coming from my own wounds. I've been in thisspace for many years and I've been leading engineering teams, building datasolutions, databases, distributed systems, as well as higher-level things likeentity matching, MDA, master data management. Data quality was always a bigproblem. You can make the system work and function and be fast, just theoutcomes are often corrupted by the bad data, which was a big deal back when Iwas at Reltio. We were building master data management solution where injectorfor bad data, even a small amount of the bad data, may ruin the whole processof resolving different entities to the same thing.
A good example would be, very common, phone numbers gottemplated. Someone put 1111 for a bunch of profiles and it results in a hugemismatch of things internally which required very, very expensive resolutionlike going into the database, fixing these things. That's where the idea of theTelmai came in. Why do we fix after the fact? We need to prevent that bad datafrom coming in the first place before the expensive and sensitive systems getcorrupt. That's how it all started.
Why is it such a hard problem to solve? Because I feel likewe've talked about data quality has been a thing we've talked about. I feellike we make all these strides on our ability to store huge data, but I feellike we continuously get stuck on data quality. Why do you think that is?
You're right, it's a difficult problem. To add to this,we've seen a big shift in infrastructure monitoring, which happens around 2011,2013 when we all went to cloud. Data came in later, even though we've beenworking with databases for much longer. But part of the problem is really thevariety, the complexity of the data itself. There are so many variants. Thereis such a big velocity. Data is constantly streaming in. There's such a vastamount of this data. Quite honestly, there's not that many ways to tell youthis is good data.
Companies for a long time were defaulting either handwrittenrules or manual checks of the data, which is not a solution in the modern worldwhere everything is data. It's coming from everywhere in such a vast volume.Why it is hard? It's not a one-team effort because it spans from peoplebuilding the pipeline, the system they're using, what kind of reliability toolthey have there, coming down to who are the owners of the data, data productowners. Then sometimes it's data stewards, sometimes it's some other checks anddata quality assurance and stuff like that. It spans across multiple areas ofthe organisation.
On top of the technical challenge with just sheer volume ofthe data and the metrics you need to collect from the data to theorganisational constraints, how teams communicate, how the process are andorganisations are defined, and so on. It's a multidimensional problem. It's notjust a technical problem.
Would you argue it's a people problem as well? Just how dowe get everyone in all these different teams to work together cohesively?
That's pretty much where I'm heading, yes. It is as much aspeople problem on top of a technical problem.
Who are the players in, not so much your space, not theusers at Telmai, but call it in an organisation. Who are the stakeholders andthe people who are affected and have to solve this problem?
It still will be familiar groups: data analysts, dataengineers. Quite often, we start in seeing the new roles coming up, dataproduct owners. These are product managers who are owning the data product,data as a product. Those are pretty much the groups interested in thosesolutions. The interesting part is how to make them efficient and effective inusing these solutions and helping them in the daily life. [inaudible 00:04:29]something is providing better insights, timely insights, and stuff like that.
When you think of, let's call it the core group, obviously,you should care about data quality from the early days, but at what point doyou typically see that people start to bring in observability tools? As earlyas their first data person, as when they have a team? When should you reallystart thinking about a solution in the observability space?
What we've seen, it really depends on the organisation.Another challenge comes with, should you build it or should you buy? We see alot of organisations tend to build solutions themselves. This is fine. This isthe right approach if you have right resources and you have a lot ofcustomisations, but very soon you would realise that variety, again, because ofthe variety of data, because of the complexity of the architecture, so manyintegrations and stuff, building that is probably not core to your business.You don't want to be doing this.
At this point, people start bringing their solutions likeTelmai, others in the market, trying to see what fits into their ecosystem. Butwe also have seen moments when the new data person comes in, someone in chargeof company's data, basically. They're all named differently sometimes: VP ofdata sometimes, data product owner, and stuff like that. In these cases, theycome with the concept, "I need data observability because I know what I'mdoing and trying to build it from the ground up to put proper architecturearound that." Even without the big team, data observability becomes theessential part of this journey. It's two sides there. There is not only one oranother.
What's the variables you should consider? Because there areso many problems to be solved. There's so many products in the DataOps spaceand the MLOps space. What are the symptoms of when a data observability tool isthe right solution for you?
I think to make it a little bit more concrete, we need tomention that data observability itself is a very wide topic. There are so manydifferent flavours of data observability and it's not one-size-fits-all. Somecompanies, they care more about truly data pipeline observability, making surethat the pipeline delivers the data. If they know that the data itself might bemachine-generated data or such, they may not have a lot of focus on dataquality itself.
The other companies, for example, data providers, they wouldcare a lot about what they are shipping to the customers, what kind of quality,the values are accurate, correct, and stuff like that, which is very differentfrom data pipeline observability. These are different solutions and reallydepends on what is core to the company.
Oftentimes, we see that tools like Telmai, they resonate themost with full data products, where the data is the product being served by thecompany to their customer. It can be internal or external, but still the datais an essence and the accuracy of the data matters a lot.
Got it. In general, it's almost like when you're at a pointwhere you are thinking about your data as a product is the point wheresomething like Telmai is going to become a core. It's almost a requirementbecause if it's your product, you want to understand. It's almost like qualityassurance, like if you're selling this thing now, it better be pretty standard,pretty good, pretty reliable.
Same thing as if you're developing your own product, youdon't like when your users telling you you have a bug in the product. You needto know about this problem beforehand and fix it. The same with data. You don'twant to be in the situation where your buyers or the data say your data is crapbecause if they say so, they don't really have a good way to measure how bad itis. You don't have a really good way to measure how bad it is. It's reallyimpacting the reputation of the company, of the product itself. It's a very bigimpact.
You mentioned there's these new roles around productmanagers specifically for data, like treating data as a product. This is arelatively new concept, I think. This is something about data and databases anddata theory and systems have been around for forever. I think really trulytreating data as a product is a relatively new idea. What are your thoughtsabout it? Is it correct? Should we all be doing that? Is it right for somecompanies? How do you think about that role and where it should exist?
I think we're still in the phase where it's settling down.There is still sometimes no clear separation of the roles and responsibilitiesbetween the data engineering team and data product management. That's why Iwould suggest think of it as really just different roles acting together,building the better product. Just separating them would not be the rightapproach at this moment. It's so new and so fresh, there are still too manyunknowns. We are trying to treat them as part of the same group. They haveslightly different skills, but the goal of what we are building is to bringthem together so they can communicate and they build better product faster.
Got it. Your take, would you argue that we haveoverfragmented? You mentioned so many different types of roles and titles thatare all data people. Do you think it's a bit overfragmented? Do you think thereshould be more consolidation, or do you think that we'll continue to get moreand more specialised roles?
I would say yes, I agree with this. It's totally fragmented.Quite often I hear talking to the customer, to data engineers, "What doyou do when you detect the inaccurate data?" "No problem. It'sproduct manager's problem." This is bad for our organisation. I know whereit's coming from. It's coming from the deficiency of the tools and processes,but any company should act like we care about the end result, the product, theend satisfaction of our customers rather than, "Data delivered? Good. Notdelivered? Okay, we have a problem."
When you think of a gold standard, when you think of how adata team should work, or you're architecting a data team, how would you goabout it? What would be your sense of the best practice? I know it's a hardquestion, but I'd love to get your take on it.
I would try to apply my experience. I'm coming fromapplication system development. Really try to follow similar approach to how webuild engineering teams. It's not just developers who are needed to build theproduct. It's developers plus QA plus product manager plus DevOps or SRE. It'sreally a combo.
It's very similar for data product. You cannot just push itall on data engineers alone. It has to be data engineers. There has to bepeople who are focusing on the assurance of the quality of the data using theright tools. It doesn't mean they're doing it manually, but building the stuffaround that. There should be people who are driving how this data product lookslike, what we are delivering, what the SLA is, and many other stuff, customerexperience with this data product.
Got it. It's almost like there's a GTM arm of figuring outall the concepts, almost like a product arm of what are we selling and how arewe selling it and what do we need to sell it. There's engineers who are just atits core, just building, and there's almost a QA function just because datahas… By design, there's so much entropy there, so you have to constantly befighting it. Even more so, I would argue even software engineering, where agood set of tests and high test coverage will get you a lot of the way there,there isn't really an equivalent measure for a data system.
Yeah, and then there is this whole concept of the DataOpssomewhere spanning across these layers. Sometimes these are the same people,it's just really the process of the company. People are acting as DataOps,they're operating their data product, shipping the data. That's where we reallyunderstand that the tooling is a very crucial part of this. Without the righttooling, it's impossible.
Yeah, I definitely think that most ops companies are just…Not just, but they're process companies, they're workflow companies, and a lotof the problems to be solved are actually people problems, what we talked aboutat the beginning. There's a lot of focus on, let's call it tools and systemsand practice. It's more like, "Hey, we need to work like a well-oiledmachine," and throwing around a ton of spreadsheets and docs is not goingto get us there. What can we do that's well integrated into how we work thatalso allows us to work as a more seamless way?
I guess to that nature, you hinted at this a bit, but wetalked about observability around you find the error in the data, someone flagsit, the QA team in this world would flag it, and then the data engineering teamwould maybe investigate, patch, and fix.
What's the general workflow? What's the feedback loop looklike? If I'm using Telmai, how do I set it up? Is it automatic? Is there morehands-on setup? Then what's the workflow look like from… Obviously, there willbe alerts, but where do I go from there? What's the recommendation you have?
This is a very complex question and it actually has multipleparts of it. Let's maybe start with the feedback loop. On that one, we've beenliving for a while in the world where even if we detect the problem, to fixthis problem and do something, that is a manual action. This someone, someengineer has to jump in and do the action, stop the pipeline, make the fix,deploy, rerun the pipeline, analyse the data, remove the bad data, keep thegood data and keep flowing.
Now, given the velocity of these pipelines and how much databeing pumped and what's the variety of this panel is becoming a really bigproblem. It's a constant non-stop action. Alert, alert, alert, go and fix andnon-stop. It's very draining. What we believe is there's a lot can be done forautomating this problem. Not everything, but a lot can be done.
Seventy percent of these actions are rather simple. Reactionon the bad data, stop the pipeline, restart the step, or split the good datafrom bad data and let the good data flow, bad data park for furtherinvestigation, stuff like that, which increases the velocity of the pipelinedramatically because you don't have this hard stop on every single thing. It'slike you're slamming the brakes when you see a little duck walking near theroad. Just keep driving your lane unless it's starting jumping on the wheels.That's the feedback.
The other part, when the problems are detected, that's wherethis is a people process. What is the channel for notification? What are theprocesses in the companies? How to react on these things? Most importantly,tools can help to find the root cause quick. Because in the data, unlike theinfrastructure where you can get a lot of sense from reading the code of theservice which went down, a lot of times you don't have permissions to look inthe data. There are a lot of organisational limitations and barriers toinvestigating the root cause, which involves more people, more approvals, moreprocess, completely slows down everything.
Having the right tools, providing the right evidence, theright metrics, correlating the metrics, what's going on, and helping to figureout where the problem belongs, what was the root cause and which table? Maybethere are 100 tables in the pipeline. This problem started right there in thevery beginning. Go to the right person and start working with them. The rootcause analysis and the whole process is… Again, we are coming back. This is asmuch as people process and organisational process as technology. It cannot justbe solved just by technology, but technology can greatly help in speedingthings up and embedding into this process.
It sounds like there's two components. The two majorcomponents are, one, the data is always going to have issues. That's just anature. It's constantly going to have false alarms. Things sometimes drift fora good reason. I joke about how much data drift Amazon must have saw at thebeginning of the pandemic. Pretty much every company probably had a dashboardof these 4,000 red alarms because everything just changed overnight.
I guess one part's triangulation. What matters? What reallymatters? What's a P0? What's a P3? And being able to work that way so it's nottreating everything as reactive. It sounds like almost where you're getting atis having a more proactive organisation and have a tool set that is also moreproactive in a sense of resolving things. Root cause analysis is very clean,something that's being invested in heavily so that it's not a linear operation.
Is it possible to get compounding effect, and can it happen?Is it just a linear process where you know the amount of data you have, youneed to have n number of data people? Or do you think that things scale as youget better?
I actually think things can scale. With the right tooling,things can scale very easily because the less effort these people have to do,the more the tool itself does by splitting, "Okay, this is not an issue.This can be automated." Automatic restarts or stops of the pipeline,splitting the data and stuff, you don't even have a human in there for most ofthe part. Then the higher severity issues where the root cause analysis needed,people get involved. But even then, they don't have to go and run and writedifferent queries because, again, data infrastructure is so complex, there areprobably not many engineers who are experienced enough in knowing all of thisstuff.
You're getting back to that little bottleneck who areconstantly just firefighting and trying to figure out this problem, people whounderstand the system enough. The more the tool can do by providing all thisevidence by helping you get to the root cause, the better. You can really scalewith a smaller team by just eliminating a lot of routine and stuff that reallydon't involve human intervention, and let them focus on real big problems andreal complex problems. I don't believe it's linear function.
One thing you touched on briefly there was governance andprovenance and access control, all the stuff related to that aspect of data. Ithink when people think of those teams and those processes, they tend to beprocesses that, for a good reason, but still keep them from doing their job inthe sense that it gets in the way, you have to go through all this process. Doyou think that has to be true? Do you think that is true by nature? How shouldgovernance and access control be layered on and provenance be layered on on topof everything we've been talking about? What dimension does that add to theproblem space, if any?
It does. However, I believe this is complementing parts withthe data observability and data quality. The governance is still a veryessential part of the data organisation, but I think it should benefit greatlywith the quality information and the operation part of the data. Talking about,let's say, data catalogue and stuff, getting crucial information about how goodis the data, not only what is there, how it is accessed, by whom, securityaspects and stuff, cataloguing, but also what's the quality. How this qualitywas degraded and stuff is also very important for the organisation becauseultimately, I think most of our organisation are moving towards self-servicedata products. Without this component, this self-service will not be able to beachievable.
It sounds like what you're getting at is a core function. Itis something that can, with proper tooling, become minimally invasive. That'sthe future of it. Is that fair? Is that a good way to think about it?
Yeah.
We mentioned data catalogues and, obviously, we've beentalking a lot about observability. As you think about the core data stack, theearly days of DataOps, I felt like there was a billion different categories ofproducts. There's still a lot, but it feels like maybe it's starting toconsolidate into some core categories. If you were running, let's call it amid-sized data team, and you were going to pick categories of tooling to thinkabout and care about, which ones would you pick and why?
That's a good question. I would still go by what yourorganisation is building. Definitely, the data warehouse is Delta Lakes andstuff going to be the category there. The pipelines, data catalogue is, I wouldsay, being proven to be an essential part of this whole thing, of this wholejourney, like we can see at the Databricks and their efforts with Unity Catalogand stuff.
On top of that, you would have a variety of different tools,which, again, as I said, data observability, but there are so many differentnuanced ways of doing this data observability. There's going to be a largenumber of this tooling there, slightly for different reasons. At some point,there will be a consolidation, I guess. A lot of this function will go to datacataloguing. A lot of data observability might actually come as built-infunctions of the data solutions like Delta Lake, data warehouses, baselinefunctionality. I would say these three categories, but again, someconsolidation will happen.
That makes sense as the core and it tracks of what I'mseeing too, and I'm sure you're seeing, is that those have become the coretenets of DataOps.
The last topic I really want to dive into with you iseverything we've been talking about so far is about data generically.Obviously, data plays a core function in analytics. It plays a core function inML. I guess as of recent, it plays a core function in, let's call it, LLM/AI.This is the new paradigm around LLMs.
Do you think of it as a pyramid, like you need analytics andthen you need ML and then you need AI? Do you think of it as you do all of themtogether? Are the problem spaces around the data… How different are they? Iguess I'm curious to understand how you think about, how maybe the ML and LLMaspects change the data space, if at all.
I think they will significantly. What we have seen so far inthe organisations across the world, after the early success of OpenAI, ChatGPT,and stuff, everyone jumped into this LLM journey because it really showeddemonstrating the capabilities of what kind of new generation of interfaces ofthe product can be built using these language models. It's not only languagemodels built on GenAI, image generation, audio generation, and other things.But most importantly, enterprises now realise that there is tremendousopportunity right now to build a new generation of tooling with new interfaceswhich are much more capable than the old chart-like static type of interfacesproviding this new capability.
Now, if we look at analytics versus ML versus AI, I don'tthink any of this area will cannibalise any other. But specifically for LLMs,the adoption of those concepts, we will see it pretty much everywhere in everyapplication. To add to this, data quality will become even more importantbecause we have seen that the cost of building the model, fine-tune for yourorganisation, can be huge depending on what quality of the data you are feedingit for training, for fine-tuning.
We recently ran an experiment where we just… It's a toyexperiment. We trained a model classifying internal company documents in anabstract company, and they just injected various levels of noise, typicallywhat we see with customers' pipelines on a daily basis. The impact was verysignificant.
For each of these three, AI, ML, and GenAI, if we split it,there will be this, and there should be major focus on figuring out the dataquality, then the quality of the machine learning models, feature engineeringand other stuff, how this one is going on, and other aspects. Because withoutthat, we are doomed to work with the inferior models and the inferior qualityof the results.
Yeah, I guess the way I'm carrying it is data is thefoundation of everything. Even though we will have ML applications, AIapplications, analytic applications, they're all different, I guess,manifestations of data. From that perspective, data is everything. Theimportance of it has continued to move up and it will continue to move up andthat hasn't changed with any of the new paradigms. If anything, it'saccelerated it. It sounds, similarly, the new wave of LLMs, and let's call itthe category AI broadly, is going to also have a foundational effect on how wework with data as well because we're moving away from relatively static chartsto something that is much more intelligent, I guess. It can do more.
Part of this why it can do more is because with all thetooling we've been building for the last 10 years, for example, all we do isgenerating more data. We keep generating and generating more data, moreinsights, and more stuff. But the ways we are representing this data, we'restill using the old approaches. Very static. ChatGPT showed us that you canactually have a very meaningful way of summarising what was discovered, what'sstored in these enterprise databases, and present it in a meaningful form tothe user to make sense out of it, which is a big deal.
Overall, alerting was always a problem just because alertfatigue and stuff. But if you can summarise it all, if you can make it veryclear and usable when you're doing root cause analysis, it's actually not a badthing. If your language model can figure out what matters, can show you onlywhat matters, help you highlight these things, that's a big win.
Yeah, it almost sounds like we had so much data, so we hadto create all this DataOps tooling on top of it. But even that has now becomeso unmanageable as the data has continued to grow. But this new layer allows usto even summarise the 4,000 dashboards or whatever we have into much moreactionable and useful information. It's almost like the dashboard we used towork out of is going to become the assembly language of the actual insightlayer that we'll be working out of in the future.
It reminds me, there was a fun story back from my days inSignalFx, which is a monitoring system for infrastructure monitoring. We weremonitoring ourselves with the software we were building. At some point, welooked at how many dashboards actually we have internally in the company. Ithink it was 800. There's no way anyone can make use of 800 dashboards. This islike create-only type of things that tells just overall approach, just creatingmore dashboards and more dashboards and more charts, just not scaling anymore.We have to have a better way of summarizing this information. That's whathappened last year. We saw the power of these tools.
Yeah, I'm very excited to see what happens. I think it'sgoing to fundamentally change a lot of things, but especially how we work withdata. I actually think that's going to become some of the lowest-hanging fruit.The fact that it's already a pretty noisy space and a lot of what we do ismanual intervention, it just feels like such a natural place to start takingadvantage of this stuff in a way that actually lots of the value that I'mgetting and I see people getting from GPT is much more individual productivity.But I think we'll start seeing more tooling-oriented wins in places like this.
Well, Max, this has been an amazing chat. I've reallyenjoyed being able to learn more about how you think about the space and maybeget a more tactical view of how to think about and scale data teams. I thinkthat a lot of information that's out there has been very focused on one of theverticals. Not a lot of people talk about how it all fits together, and it'sbeen great to be able to get your perspective on that and talk through it.Thank you for coming on.
Thank you so much. It was a very good chat. Thank you.
See what a virtual feature store means for your organization.