Data Scientists Need MLOps; Why I Joined Featureform

October 23, 2022

min read

Developer Advocate

Who am I

Howdy folks! My name is Mikiko Bazeley.

Some of you might recognize me from LinkedIn or Youtube and already know my story but for those who are new, here are the main bullet points:

‍

Practicing for my greatest hits album cover.

‍

Before I joined Featureform, I worked as a Senior Engineer on the MLOps team at Mailchimp (which was then acquired by Intuit);

I’ve worked for as a data analyst, data scientist, and MLOps engineer in my 7+ year career as well as teaching at bootcamps;

I’m based in San Francisco, where I was born and raised.

One of the most common questions I get every time I’ve pivoted in my career is “How did you know you wanted to build a career in X?”.

The most recent variation has been “How did you and others working in production ML know you wanted to become MLOps Engineers?” .

‍

How I Got To MLOps

"Success is stumbling from failure to failure with no loss of enthusiasm"– Winston Churchill

‍

Before I started working on production machine learning as an MLOps Engineer, I was a struggling data scientist.

And before I was a struggling data scientist I was an overwhelmed analyst.

And even before that, I was a completely confused and lost growth hacker.

‍

As an undergrad I initially attended UCSD with a rather vague idea of eventually going to medical school, taking classes in public health for fun while taking organic chemistry. However, I decided I wanted to understand humans at a more macro scale, like how we make decisions and codify practices into culture and spent my remaining undergraduate years studying biological anthropology and microeconomics.

‍

Without realizing it, I would be engaging the practice of ethnography (studied during my time as an anthropology student) through intensive hands-on experiential learning as a data and machine learning practitioner. Every stage of my career followed a pattern of encountering a new environment, observing the pain-points of users in that environment, attempting to solve the pain-point, realizing the obvious solution was incomplete, thereby triggering inquiry into solving the “solution”.

‍

After graduation I moved back to San Francisco to find a job where I could save a bit of money while figuring out my next steps. While working as a receptionist at a small hair salon I started to understand the challenges small-medium sized businesses faced, especially when it came to incorporating new technologies that would supposedly increase their revenue, such as CRMs and Point-of-Sale systems. I started to become curious about the power of leveraging data to help grow cash-strapped brick-and-mortar businesses.

I started practicing for my greatest hits album at least 5+ years ago as a growth hacker for Recruitloop.

‍

The next chapter of my career after the salon was an introduction into DataTM, big and small, structured and unstructured. I would go on to work as a data analyst for an anti-piracy company, as a financial analyst at the largest residential solar companies in the US (working on supply chain forecasting & sales modeling), and then working as a hybrid data analyst/data scientist for the customer success team focused on BIM 360.

‍

These companies were as different as could be (size, industry and data maturity) and yet I ran into similar categories of problems, like data access, navigating tribal knowledge and metadata about the data, and ensuring the insights and strategic recommendations I provided my key stakeholders (many of whom were the CXO’s & VP’s of their respective companies & organizations) were timely, consistent, & trustworthy.

‍

The Customer Success Team of Autodesk BIM 360 modeling delicious gnocchi.

‍

Many of these challenges were magnified in the next chapter of my career, as a data scientist and then MLOps engineer. In growth and analytics, my primary responsibilities were to use data to enable visibility into the health of the business & assist in decision making, as well as provide diagnosis when needed.

‍

As a data scientist I was now responsible for developing external-facing, predictive models and answerable to many key stakeholders when code broke. Instead of data flowing in a single direction (from source through transformation to consumer) data needed to flow in multiple directions, like a linked daisy chain of multi-armed Lovecraftian horror monsters.

‍

Some of the biggest challenges I faced as a data scientist working at a digital adoption SaaS platform and a health devices company included the lack of engineering support, the difficulty in setting up and stitching tooling together, and coordinating the many moving pieces of a machine learning pipeline as a data scientist on an island.

‍

Deploying models is hard, especially when you don’t know what “good” looks like.

‍

It wasn’t until I joined Mailchimp as an MLOps Engineer that I began to see my experiences and hard-earned battle scars working on data and machine learning systems coming together as a career in and of itself. After joining Mailchimp and having the opportunity to be a part of a functional and effective organization successfully deploying machine learning features, I was inspired to dive even deeper into the world of MLOps.

‍

The Future of MLOps

“To achieve great things, two things are needed: a plan and not quite enough time” -Leonard Bernstein.

‍

Although MLOps and production ML best practices and tools are still being developed, I’m particularly intrigued by the following opportunities and trends:

‍

Open-source and Open-stacks

My favorite question to ask experienced engineers is what they think are the three most important innovations to have occurred in the last 10 years. The answers usually center around open-source (as well as cloud platforms and general ecosystem maturity in certain languages like Python).

‍

Some of the impactful open-source projects include Linux, Git, VSCode, Eclipse, Firefox, Tensorflow, PyTorch, etc. These projects have pushed technological innovation forward while relying on the (in many cases unpaid) collective efforts of thousands of individuals. The most important impact of open-source has been the acceleration of individuals, teams, and organizations throughout time and space.

‍

For example, open-source projects have allowed me to bootstrap my data science and machine learning career, as well as help companies like Mailchimp, Teladoc and Sunrun in leveraging data science to power the economic prospect of thousands of businesses, improve health outcomes for individuals with chronic health conditions, and provide green energy alternatives using solar power.

‍

And open-source is certainly not slowing down, especially with the adoption of open-source MLOps tools by companies wanting to open their stacks in pursuit of even faster innovation.

@GergelyOrosz - "What are Platform teams, why are they important, and why do (almost) all high-growth companies and big tech have them?"

‍

MLOps stacks as unique as their use cases

How does a platform or system change based on size, industry, and resources?

What does a functional and performant MLOps stack actually look like for a self-driving automobile enterprise?

Or even a solopreneur creating a small web product with potential?

A question I’ve been thinking about is “What does a mature MLOps system actually look like?”. The common advice is “don’t build for Google scale”. But is that advice really good enough?

‍

Personally I don’t think so.

‍
‍
Everyone wants to be their best and telling teams and organizations to not build Google-scale systems is like telling my average female height and build self to not aim for the Olympics or the figure of a yoked, 5’8” 220lb bodybuilder.

‍
Yes, we get it. I get it. So what should any of us be aiming for?

‍

As James Clear talks about in Atomic Habits and in much of his content, there’s a difference between goals and systems.

For example a bodybuilder might have the goal of reaching a certain body fat percentage and muscle mass by a certain date (for example, 10%-12% BF within 12 weeks for a female bodybuilder for competition time or 4-6% for a male bodybuilder). The system is the diet, workout, & recovery regime that ensures the bodybuilder will reach their target within the specified time.

‍
Likewise for an organization focused on deriving value from their ML systems, the goal might be to increase data science and machine learning output by halving the time to POC as well as time to deployment, while keeping operational and ad-hoc tickets at a consistent volume. The system in this case is the MLOps toolchain as well as the practices and structure for its continued development and maintenance.

‍

‍
However every organization ultimately has different constraints, some of which are immutable. I am never going to be 6’0” and my tall-as-an-oak spotter is never going to be naturally closer to 5’0” than me. Why shouldn’t our systems be different and optimized for who we are and where we’re going?

‍

Race to Minimize Iteration Time

Regardless of how much we grumble about the difficulties of supporting ML in production, the reality is it’s never been faster or easier to deploy really powerful (and at times scary powerful) applications or pipelines.

‍

During my time as a Data Scientist at Autodesk, for example, Andrew Ng had just announced a new company called DeepLearning.ai. And within a short span of time later, the famous “Obama Deepfake” video from Key & Peele aired. That same year, DeepLearning.ai would release a new five course series consolidating advancements in CNNs, RNNs, and LSTMs with practical applications of deep learning models in machine learning pipelines. At Autodesk’s 2017 annual customer and product showcase (Autodesk University 2017) the big vision was around enabling generative design, AR/VR, robotics, and additive manufacturing for customers.

‍

Five years later, so many of the research initiatives I saw listed under the office of the CTO at Autodesk are now a reality. In the past couple months alone we’ve seen a release of powerful text-to-image generation models as well as Facebook’s Make-A-Video model.

‍

And competition has only increased over the years, as data scientists and machine learning engineers have gone from the research group eccentrics huddling in the corner to essential drivers of profit maximization. Between 2013 and 2019, for example, job postings for Data Scientists on Indeed increased by 256%.

‍

The widespread adoption and democratization of access to data science and machine learning tools (as well as domain knowledge) heralds a new era where in the competitive marketplace of ideas, the winners will be the ideas that go-to-market the fastest and most reliably.

‍

Some reasonable questions to ask are: Which teams and organizations are going to deliver the fastest? Who should MLOps open-source projects be focused on?

‍

Data Scientists As the Primary Consumers of MLOps Tooling

The most successful projects will focus on enabling data scientists as the primary users and owners of tools to enable production machine learning.

‍

‍The current landscape of MLOps tools leaves much to be desired. Not because they aren’t powerful in solving specific pain-points, but because they assume a high level of infrastructure and software development experience to operate.

‍

Data scientists come from a wide variety of backgrounds and are often tasked with a wide variety of responsibilities that run the gamut from interfacing with product and legal, developing POCs, model training and evaluation, pipeline productionisation, and even deployment in many cases.

‍
Yet given the wide scope of responsibilities and the diversity of skills they bring to the table, they are often treated antagonistically as burdens on platform teams or saps on company resources.

‍

“Data scientists have the most diverse fields-of-study, while software engineers have the least diverse educational backgrounds.” – Indeed

Some maladaptive practices I’ve observed include:

Placing all the burden of model productionisation and deployment on “execution or implementation” teams – an ML or ML Ops Engineer is embedded in a data science squad & is responsible for writing all the tests, refactoring feature engineering and model training code, implementing logging & monitoring, etc – these individuals eventually burnout, especially as teams hire more data scientists;

Sharing responsibilities between data science teams & MLOps team but only grudgingly supporting skill development for the data science teams, thereby eventually converging on the prior pattern;

Or defaulting to hiring only candidates that have a strong software engineering background — this is an option available only to companies that have significant resources or eventually leads to a dearth of innovation, especially if the SWEs don’t care for the research and exploration aspects of producing machine learning products.

‍

A possible solution (although not the only one) is to design projects and tools that support strong engineering practices (as well as company initiatives around data governance and access) and are inclusive of users of all types of skills and experiences. Instead of seeing your data scientists as bottlenecks and enemies of best practices, empower your users with the tools and knowledge needed to produce quality machine learning applications and pipelines.

‍

In other words: Make it easy for people to do the right thing.

‍

Featureform: An Essential Component of the MLOps Ecosystem

I’m excited for the opportunity to partner with Featureform to address the opportunities I talked about earlier because:
‍

Feature stores are currently (& are going to be) a critical component of robust & transparent systems

When I first started working on production ML and then MLOps I didn’t have a very positive opinion on feature stores (or most standalone point solutions that required significant lift to implement) for many of the reasons Simba talks about in his blog post on the different varieties of feature stores.

‍

Feature stores seemed like an unnecessarily complicated component of an MLOps platform, especially given the preponderance of powerful data warehouses already available.

‍

However as a data scientist and MLOps Engineer, I’ve experienced all the following pain-points in developing and deploying machine learning projects:

Starting a project that required querying unfamiliar tables (either because I was new, the project, the tables were new, the team was new, or the company was new) and having to try to track down any and all documentation and metadata related to calculations and features. If no information was to be found beyond pestering adjacent teams and the Confluence space was light on meaningful details, I’d resort to crossing my fingers and using the features, hoping my attempts at due diligence would stave off unexpected calamities.

Trying to collaborate and share features or transformations with other data scientists and having to check-in code, data, and models while still not having any means of documenting the key business logic that motivated the features.

Most importantly, remembering & sharing features with my future self.

‍

The horror, the absolute horror! It's like the glitter of the data science world.

‍

When a feature store is done right, it can:

Speed up model iteration cycles by making it easier and faster to find existing features that others have created to use in new models and by ensuring Past You didn’t place the logic or featurized data in an ad-hoc notebook or Python script;

Increase model reliability by cutting out duplicate code & enabling visibility into feature metadata such as trending throughput and failure rates, cleanly defining the business logic and tables being used (in case there are upstream changes), & by facilitating organization of the features for easy drop-in when developing PoC models;

Make collaboration delightful, especially with other data scientists, ML engineers, & even analysts, because you now have a place you can point them to checkout or contribute features as well as documentation about the feature for self-service;

Preserve compliance, the most challenging of activities at times, by leveraging data access policies as well as automatically recording data & table lineage of features in case IT or the data engineering team need to limit access to tables & features.

‍

Less Tools, More Abstractions

"Cause I knew you were trouble when you walked in"- T.Swift

The right abstractions are crucial to actualizing creative and technical potential.

As an experienced practitioner and mentor I’ve been blessed with the opportunity to interact with people from a wide variety of backgrounds and environments, representing a broad range of concerns. And in the past couple years the lack of tooling has only come up with regards to the 1% of companies doing incredibly specialized, cutting-edge research and development.

‍

In fact I’ve heard and seen quite the opposite, that we have too many options for teams and organizations to feel confident in choosing a tool. As we’ve seen with jam, more options isn’t necessarily better – in fact, when individuals are flooded with options, they are not only less likely to switch to a new option but are more dissatisfied with their choice than if they’d been presented with a more curated selection.

‍

So how does Featureform help, rather than add, to the chaos?

‍

Core focus on helping data scientists deploy models, not struggle with building platforms: The virtual feature paradigm enables data scientists and data science teams to take advantage of all the benefits of a feature store with the same ease as leveraging a Scikit-learn or PyTorch object. Featureform can be used locally (if a data scientist prefers to experiment and develop models from their trusty IDE or loves Docker) or in the cloud through GCP, AWS, and Azure. And with some simple YAML based configuration, Featureform will manage the metadata of all features, from their names, versions, descriptions, owners, providers, transformation logic and more. Featureform also maintains a history of changes, different variants of features, and enforcing immutability.

Help your data scientists do the right things without thinking: Let’s start with positive intent and assume users need a bit of help in supporting governance and compliance. Featureform does this by baking in best-practices into the toolchain without changing a data scientist’s typical workflows with built-in role-based access control, audit logs, and dynamic serving rules. Compliance logic can be enforced directly by Featureform so that rather than fighting with IT and DevOps teams about standing up a physical feature store, data scientists can now supercharge their workflows, collaborate and share transformations without constantly duplicating work, and supporting governance and security best practices with a virtual feature store.

Plug & play the compute & storage: Rather than us dictating the MLOps stack for you (although I can’t say I’m not opinionated), you can instead mix and match tools in and out of the open-source ecosystem as well as your favorite public and private cloud offerings. Use abstractions like the virtual feature store in combination with best-of-breed tools to create a toolchain capable of growing and maturing as part of your journey. Featureform can also be used when templating pipelines or projects for your data science team.

Open-source: A key and attractive aspect of Featureform is that it’s open-source. Not only is Featureform FOSS, the team at Featureform is just as excited to contribute back to the open-source MLOps ecosystem as they are in growing their own project.

‍

People at the Heart of Everything

While a good portion of this post has been about my Very Big IdeasTM about the future of MLOps and waxing lyrical about the potential of Featureform, the heart of every decision I make comes down to a crucial element:

People.
‍
What struck me the first time I met Simba and Shab was how grounded and fun the conversations were and the values we shared, not just as active members of the MLOps community but also as individuals.

‍

Much like Shab talked about in “Why I Joined Featureform”, my goal at every point in time has been to do my best work and to constantly challenge myself with scary growth opportunities.

‍
And just as Simba talks about in “Lessons Learned: From Google to Building a Profitable Startup”, I’ve never been content with anything less than full ownership and autonomy of my work.

‍

Although I wasn’t actively looking for a new opportunity and had both support and intention to focus on the next promotion into management at Mailchimp Intuit, the timing could not have been more ideal.

‍
I can’t wait to join Simba, Shab and the rest of the Featureform team in building out an integral component of MLOps stacks for the rest of us and giving back to the OSS ecosystem.

Data Scientists Need MLOps; Why I Joined Featureform

Who am I

How I Got To MLOps

The Future of MLOps

Featureform: An Essential Component of the MLOps Ecosystem

People at the Heart of Everything

Ready to get started?

PRODUCT

RESOURCES

COMPANY

PRICING

DOCS