As enterprises are starting to engage machine learning models and embed them into heavy-duty production systems, they face a lot of hurdles. Especially, MLOps lacks enterprise-grade feature stores to stores, search, replace, and collaborate on ML models. This article will explore that problem in detail, and explore one such solution offered by Tecton.ai.
Compute and Storage Models Have Changed
Until about 15 years or so ago, we had a software development/deployment problem – releasing a new software version, collecting feedback, identifying new features, and gathering changed customer requirements were all too slow. It took a long time to get the feedback and close the loop in developing new versions – sometimes months or even longer. The still-evolving DevOps and agile methodologies solved those problems – developers were able to target specific functions and made incremental changes to just those functions. The CI/CD pipeline took care of the rest, and the DevOps methodology made sure the changes went into production smoothly without breaking anything else. From monthly/yearly cycles, the software development and the delivery process moved to weekly, daily, and sometimes even hourly, release cycles.
In recent years, a paradigm shift has occurred. From a code-based/compute-based economy, the enterprises have moved to data-based economies. Data happens to be the most important part of the data economy and Machine Learning (ML) models. It’s also the very difficult part of the whole ML process to get it right. Data Scientists work hard to unearth value from the onslaught of data. Data is arriving at a pace that is almost parabolic on a daily basis. While most enterprise tools can work with historical data, they still struggle with streaming data. As I discussed in my previous Forbes article, making decisions based on partial data will lead to executives ignoring insights provided by IT teams and not supporting their future projects with appropriate funding.
People often forget that data scientists are more experimental, who experiment with data, to unearth some business value out of it. It is not their job to productionize the model or integrate it into an existing application. Many projects fail when people don’t understand this difference and their work generally stops when they create a solid working model. Getting it into production and deploying it to the edges for inferencing is another battle.
In the process of creating a good model, the data scientist (DS) faces a real challenge with the following tasks:
Finding the “Right” Raw Data
Before finding value, or trying to solve an issue, the data scientists spend most of their time searching for the right raw data. Unearthing, coordinating, and corralling the data together before creating ML models is a monumental task. Most enterprises don’t have proper data catalogs in place. Also, if proper data doesn’t exist, then it must be requested, created, collected, stored, and logged before it can be used. Even if the proper data exists, most times the DS faces a data access problem. Getting access to the right data and being properly authorized to use the data for the purpose it is intended to be used for is another challenge. Even after getting access, they are faced with a problem of either taking a one-time static data dump for model creation or setting up procedures to get fresh data when they continue to work on the problem. After all that, they need to spend time understanding the regulations, governance, privacy, and security issues related to that data set before they can act on it. Not all data can be used to create models.
Feature Engineering – Extract Right “Features” that are Valuable to the Enterprise
From the raw data, the DS needs to unearth the right data (or features) that can be used in creating a potential model. This training data set should include data from multiple sources. For example, related information from the data warehouse and data lakes combine with real-time data streams that are relevant to the potential model. Then from that data set, they need to cleanse the data from potential bias and skew. I wrote a detailed article on this topic recently that can be seen here. If this step is not done properly, you will have a “Garbage in, Garbage out” situation. They will either create a biased model or skewed model, neither of them will serve the purpose. Another important feature in creating a training data set is to create a snapshot based on a timeline. Another important aspect is to use the appropriate snapshot. For some models, the DS needs to take a snapshot at a specific time frame and arrange all of the associated data from that timeline. They need to ignore the events that happened after that time and ignore the data from the “future.” Common data systems aren’t designed to support time travel, which often leads to “data leakage”.
And while trying to get the models to production, the enterprises face the following issues:
Feature Serving in Production
After the models are created, serving data/features to those models is an important task. When an ML model is at work, if it is served a different data set/features than what was used in training, the model will predict erroneously. This is called a training/serving skew. Particularly, there can be either data transformation logic skew or timing skew. In such an event, the model’s predictions are completely useless.
Feature/Data Completeness in Production
After streamlining data/features, with ML models from experimentation to production, the models can still break. The most common cause is “data breakage.” This can be because an upstream data source stopped/delayed in supplying data or even supplied the wrong data set because of an issue. Or the features that were used to train the model could drift, hence the model will drift from its accuracy which would require retraining the model. Another major issue would be the quality of the data itself. It can start to skew over time and will produce erroneous results.
After all that work, if another DS tries to create a similar model for a related project, there is a centralized “feature store.” Almost all of the work that led to the model creation is lost! Steps #1 & #2 need to be created again. There is a reason why the DS spends almost 80% of their time in prepping the data and only about 20% of the time creating the models that serve real value. To solve this problem, almost every data company – Uber, Lyft, Google, Facebook, Netflix, Twitter, Airbnb, and many others – spent the last few years building a data platform that can solve this issue, and Uber’s Michelangelo is an example of such a platform. It helps bring tens of thousands of models into production by serving features at the right time and the right speed and latency.
Assuming the DS got the model right, now comes the real problem. Moving ML models into production. The concept of MLOps has gained some popularity and maturity over the last few years. It is essentially the DevOps cycle for the ML models and makes it easy to operationalize the models. These platforms generally serve the feature updates near real-time with millisecond latency.
I had a briefing from Tecton.ai on this topic recently. Their platform was built to solve this issue by the original creators of Uber’s Michelangelo. Their platform seems to address the very issues that I discussed in this article. A feature pipeline to address the data to feature transformation, a feature store to store the features to search and discover, a feature server to serve the latest feature values in production, and a monitoring engine to detect the data quality and alert drift issues.
Conclusion
While productionizing machine learning is still maturing, tools like this help solve a piece of the puzzle so enterprises can produce ML models more efficiently and streamline their ML model production pipeline. Particularly, tools such as feature store can help decompose the monolithic ML pipeline into feature pipelines. By identifying, modifying, and improving the features, the models can be easily changed instead of starting the work from scratch every time.