Machine Learning no longer solely inhabits the world of academia and research, it is now being actively applied in a wide range of industries and professions, impacting real world applications and bringing in business value on a daily basis. We are seeing enterprises in the financial sectors, medical field, telecommunications, and retail leverage ML to drive faster and better decision making. Additionally, there’s a shift in the industries “from technology R&D (research and development) to operations ”1. This means that managing the operational side of ML, with tasks such as model deployment, production monitoring, continuous integration, etc, is just as crucial as model building. Whoever can best manage the operational side of ML can best reap the true benefit of ML.
For many organizations, a common barrier towards delivering business value from machine learning is the difficulty in trainsitioning ML models from the experimentation and development stage into production environments. This is where the nascent field of ML Operations has found its niche. This article will provide a high level introduction to ML Ops and explain its role in the modern data ecosystem. We will also briefly discuss how good ML Ops practices can play a key role in the growing field of privacy and security in ML.
Whether you are a professional considering building a career in the big data and ML ecosystem, or an organization looking to assemble an elite data team, we believe that ML Ops is something you should understand and pay attention to.
What is ML Ops?
From the 2020 State of AI Report, “25% of the top-20 fastest growing GitHub projects in Q2 2020 concern ML infrastructure, tooling and operations”2. The term “MLOps” has also been steadily rising on Google Search traffic:
So what is MLOps? Due to the evolving nature of the domain, a precise definition doesn’t exist, but ML Ops can roughly be considered DevOps for production-grade Machine Learning. In other words, ML Ops specialists are responsible for all the infrastructure, tooling, pipelining, and automation to support an end to end Machine Learning lifecycle.
Some of the challenges include:
- Helping Data Scientists and ML engineers to move models from a research and development phase to a scalable, stable, and production ready architecture.
- Scheduling and orchestrating the process of building, deploying, and fine-tuning Machine Learning models in a production environment
- Managing the complexity of ML lifecycles with regards to memory, storage, latency, and bandwidth requirements over distributed systems
- Ensuring reproducibility of ML Models
- Managing versioning and metadata for ML models and experiments
- Monitoring real-time model performance in a production environment
- Scaling ML pipelines over a distributed system to support exponential growth in data volume, prediction requests etc.
- Identifying and mitigating cybersecurity risks in the form of data leaks, data corruption, model poisoning etc.
Who Needs ML Ops?
There are two situations where having ML Ops expertise embedded in a data team are almost inevitable:
- When scale matters. If the volume of data being processed by a Machine Learning pipeline is too large to be managed by a single server, ML Ops expertise becomes critical in ensuring stable and reliable operations over a distributed system. This includes use-cases such as distributed training, distributed data preparation, serving high-volumes of predictions via multiple model replicas, etc.
- When Models are embedded in ‘Production’ Software. Production here refers to any software that is consumer facing, and/or has real time constraints. In other words, if the outputs of an ML Model directly affects the state of production software with no human intervention, it can be of critical importance to have ML Ops expertise to ensure fault tolerance and reliability. A good example of such a use-case is recommender engines to personalize and update product offerings directly for consumers.
We often see organizations hiring a combination of pure DevOps professionals, backend developers, and data scientists to collectively manage these challenges. While this works well for some organizations, the lack of specialized domain expertise can lead to a lot of inefficiency. The most common problem with this patchwork approach is teams developing solutions from scratch or using tools and frameworks specialized in a different domain to solve ML Ops problems. For example, a lot of companies still manage entire ML Lifecycles using internally developed orchestration frameworks and APIs, while most of this work can now be abstracted away by using industry-standard frameworks like Kubeflow and ML Flow.
There is a case to be made that the role ML Ops shares a lot of overlap with ML Engineering and Data Engineering. While this is indeed true, it is helpful to identify ML Ops as a unique specialization as the complexity of ML projects scale. Data Engineers are experts at manipulating big data and ML Engineers are experts at building high quality models. While sometimes they can both handle ML Ops, if they are spending a large majority of their time solving ML Ops problems, it means that your organizational needs have scaled to a point where you might need to hire ML Ops specialists and free up your other professionals to do what they are best at.
ML Ops for Privacy and Security
As we discussed in our recent podcast with Omer Khan, the big data ecosystem has created brand new cybersecurity risks such as model poisoning attacks, data corruption etc. Traditional cybersecurity experts are specialized in handling IT and network level risks, and are often unfamiliar with the idea of data itself being an attack surface. ML Ops experts are uniquely positioned to understand and manage both the traditional cybersecurity risks of infrastructure and networks, and the unique risks of big data machine learning.
Moreover, when working with ML Pipelines, the risk of data leaks and privacy breaches are often neglected. ML Ops specializes can also play a key role in minimizing privacy liabilities by:
- Implementing secure identity access management
- Using encryption and anonymization wherever applicable
- Enabling other more specialized privacy preserving techniques such as federated learning3.
While all data professionals have unique and important roles to play in the privacy and security domain, ML Ops experts can drive the discussion by outlining what’s possible from an engineering and operations standpoint.
Project Alesia is co-founded by two ML Ops professionals, and we are deeply passionate about this domain and its future, and plan to produce a lot more content on this subject. We believe that this professional specialization will be of critical importance in the coming decade, and that the next generation of ML Ops experts can play a key role in shaping the privacy and security landscape of the future. Stay tuned for more in-depth content on ML Ops, including technical resources which can help you in your career, and add value to your organization’s data initiatives.