Bob the data scientist is working on building a recommendation model for his e-commerce company. The model aims to find the most relevant content to recommend for each user, improving retention and engagement rate, and ultimately improving profit margins to drive business impacts.
So Bob opens up the jupyter notebook on his local laptop, and begins to clean out the user datasets, construct relevant features, and build the model. While doing so, he installs a bunch of dependencies, tests out multiple sets of features, and tries out a couple of different models.
When Bob is satisfied with the model performance, and hands off the model to his team of engineers for deployment into production with live traffic and data. The model performs very well initially, but over time, the team sees the model performance decline and starts to give out inconsistent recommendations to potential customers. So Bob opens up his laptop, starts to clean out a new batch of data to retrain his model, and repeats the entire process.
This cycle is manual, and script-driven, where the model is the focal point in the delivery of ML efforts. Transitioning away from a model centric workflow, into an end-end ML pipeline workflow brings many advantages. Pipelines help to automate and structure the workflow, and ensures reproducibility and consistency.
What is a Machine Learning Pipeline
Machine Learning (ML) pipelines help to automate the ML life cycle, streamline the workflow, and unlock faster iteration of models from development to deployment. Pipelines allow data, models, and experiments to be more easily tracked and monitored. This is especially important when there are many models to maintain and monitor in production, and when collaborating in a team with other developers.
Some advantages of pipelines include:
- Automated and faster iteration of the ML lifecycle
- Better scalability, consistency and reusability
- Easier to manage, maintain and monitor
- Version controlled source code
- Facilitates better cooperation between data scientists and engineers
A ML workflow typically consists of the following core components:
Data preprocessing: real-world data is often noisy, and could contain missing values, outliers, and anomalies. The first step in the ML workflow is usually to clean the raw data, and prepare it into data that’s ready for feature engineering or modelling.
Feature engineering: finding the best set of features or variables from the dataset is crucial for model performance. If the data or feature inputted into the model is not a relevant indicator of what you’re trying to predict, it won’t be able to generate accurate predictions or find meaningful patterns. For example, to detect credit card fraud, some important features are: location of the merchant, transaction type, transaction amount, etc.
Model training: is to train the model and let it ‘learn’ from the datasets. For example, in supervised learning, models are fit on the training data to learn a function that maps an input to an output. A simple instance of this is linear regression,
f(x) = W_0 + W_1 * X , in which training means to find the weights
(W_0 & W_1), to get a best-fit straight line, and minimize the cost.
Model evaluation: after the model has finished training, the performance and accuracy is evaluated. Ways to evaluate a model’s performance include: accuracy, precision, recall, ROC, confusion matrix, etc.
Getting Started with Machine Learning Pipelines
Some of the most popular and open source orchestrators on the market right now are:
Kubeflow: A cloud native platform for machine learning workflows - pipelines, training and deployment on Kubernetes.
Airflow: A platform for architecting and orchestrating data and ML pipelines.
Both Kubeflow and Airflow pipelines are great ways to construct portable and sclable ML workflows.