MLOps for Non-ML Engineers 04 - Unique aspects of an ML Project Execution

Continuing comparison between Machine Learning and Software Development, from the previous article in the series.

MLOps for Non-ML Engineers 01 - Introduction

MLOps for Non-ML Engineers 02 - Differences Between ML and SW Dev

MLOps for Non-ML Engineers 03 - More Differences Between ML and SW Dev

MLOps for Non-ML Engineers 04 - Unique aspects of an ML Project Execution

Differences between Software Development projects and Machine Learning projects were discussed in the previous articles. Let’s look at ML Engineering responsibilities and other unique aspects involved in a Machine Learning project.

Roles in a Machine Learning Project

In a multi-disciplinary team, Data Scientists and ML Engineers will have separate responsibilities to contribute to.

Data Scientists, as the people with knowledge and experience on working with data, can focus on the Machine Learning part of the process. They have a deep understanding of the type of solutions to follow, and a working knowledge on the Machine Learning offering of the platform they are working on. They will decide the data management, experiment, and training details of the Model and contribute to designing the deployment pipeline.

ML Engineers, as engineers of other disciplines are called, are better equipped to handle the DevOps side of the process. They are experienced in software development, CI/CD, Cloud, and operations. They would also have an understanding of the Machine Learning lingo and platform administration knowledge and experience of the platform the team is working on. They will decide operational aspects of the platform, compute resource management, library, pipeline, and tools development, and adapting standard software development best practices into the process.

One interesting pattern I have seen is engineers who are Data Scientists by training, having to work in the DevOps and software engineering space that has to be filled in the team, where the team is almost fully composed of Data Scientists. This could be by choice or as a result of the demand from the team. While this is a good thing for the engineer (while switching disciplines is hard, it can be a great learning tool, and I have utmost respect for people who do it anyway), the project might be losing precious Data Science competency trying to do all of it “in house”. You can’t expect an engineer to watch a Youtube video and implement good DevOps (and no, Kubernetes does not “solve” DevOps). The project team should have a mix of Data Science and ML Engineering competencies for optimal work.

Machine Learning Platforms

Machine Learning having specific hardware requirements to do serious work is a limiting factor for adopting bare-metal platforms for development. In addition to this, most Cloud platforms that offer a Machine Learning Platform offer good integration with related services such as distributed and durable storage, elastic compute, and data analytics and ETL services.

Amazon SageMaker, GCP Vertex, and Azure ML Workspaces provide Data Science oriented developer and deployment services that are directly integrated with the other services. They differ in their approaches to data and deployment management, and there are feature gaps between the providers depending on their feature roadmaps. However, all of them provide a Notebook interface, flexible Notebook compute, Dataset management, (limited) Feature Stores, Model Registries, and Experiment management. However, role separation between Data Science and platform administration work can be a bit blurred, especially when all three providers have Machine Learning Specialty qualifications that focus on platform administration as well. This could be a bit confusing to anyone without Cloud practice.

At the time of writing this article, some features that I wanted to automate in Azure ML Platform, when writing a set of template Notebooks for the Data Scientists, were not available in the APIs for tools like Terraform to use. This could also be something to consider (albeit a somewhat smaller detail) when you’re first evaluating platforms to base your solution on.

Amazon SageMaker tries to separate the rest of the AWS services from Machine Learning services by separating out basic functionalities needed by a Data Science role from the DevOps functionalities. SageMaker has been introducing new features recently to optimise deployment costs and streamline Notebook version controlling. SageMaker also offers SageMaker studio that integrates the previously discussed features to offer an easy experience to train and deploy Machine Learning Models.

Amazon SageMaker also offers good debugging, tracing, and visualisation capabilities during model training for better visibility into the process. This minimises the guess work to be done during training and provides better insight into (ex:) which parts of the Neural Network are effectively active during training.

A different part of the industry is the multi-persona Platform as a Service providers. These include providers like Databricks, Dataiku, and Cloudera where different roles, fulfilled by different engineers, can work on the same platform with limited overlap between them required. These platforms provide integration support for most public Cloud storage services, and have varying degrees of vendor lock-in to suit your taste. They will require specialised training to get the maximum benefit out of the platform.

Another option is to deploy a mix of Open Source solutions on the Cloud platform of your choice, only utilising the basic services such as compute, storage, naming, networking, and routing. Tools like Apache Airflow, KubeFlow and MLFlow could be useful, however I have not worked with these in conjunction enough to provide an opinion.

Aspects like Auto scaling and resource management are still part of the process. Model deployment tends to drift towards Container based services or serverless offerings on public Cloud platforms. Custom images may need to be built for training and deployment processes. These are not so different from a software development project.

DevOps vs MLOps?

Most engineers who are new to Machine Learning and are interested in the operational side of things try to make meaning out of concepts they already know. This is why they would probably start searching for the phrase “devops vs mlops”. Most results for this term ends up being generic articles that mention something along the lines of “devops is about development, and mlops is about machine learning”. In my opinion this is a simplistic reduction of the two concepts.

DevOps is not only about pipelines or pre-commit configs. It’s about an organisation wide culture that emphasises on developer-operations collaboration. Contrary to popular definitions, using Terraform doesn’t “make you devops”, unless you drive (in most cases) a paradigm shift in the way different teams work together for common business goals in development, unless you consider failure as normal and aim for optimal error rates, unless your culture is oriented towards accepting risk than avoiding it, unless there’s close communication between the development teams and the operations teams on incident response, unless your product and service managers have stopped asking for “five nines availability” for products that have no business being that much available. Implementing DevOps in an organisation is sometimes a business process impacting change involving CoEs, consultations, and changing reporting lines.

It’s common for organisations to pick their existing operations engineers together to one team and call them DevOps or SRE, without a single top-down change management process to increase collaboration or to champion DevOps practices. Even worse, some would have separate teams named SRE and DevOps in the same company on the same product.

On the other hand, MLOps is a specific pipeline implementation that aims to increase the Model development, training, and deployment cadence with project or team specific tools. These tools involve Model Registries, Feature and DataSet stores, Notebook servers and development compute resources, pipelines with special hardware attached runners, and specific monitoring set up to trigger Continuous training and reporting for the Data Scientists. It’s rarely about non-technical processes, and has no impact on organisational culture (except may be for the way data is collected and treated from different sources like sales funnels and CRM solutions).

The impact of implementing MLOps is local to your data engineering and data science teams, and needs no additional top-down support than what the ML project is already receiving. It’s no different than chosing to use Terraform based automation for environment management instead of ClickOps for software development projects. Other than the initial business decision to commit to a platform service provider, there is no major business level decision to make. In most platforms mentioned above, MLOps has nothing extra to implement other than sticking to the vendor provided best practices and features. Organisations need more support and time for deliberation on actually deciding on investing on AI/ML projects than deciding which MLOps implementation to use.

It sounds like I’m reiterating the same point, but I have seen again and again how people at all levels of the industry misconstrue the meaning of DevOps down to CI/CD pipelines and comparing MLOps vs DevOps as two pipeline implementations. At the risk of sounding a bit too pedantic about the topic, I should emphasise the apples vs oranges nature of such comparisons.

Project Management

Unlike software development iterations which can be fairly short (two-week Sprints are common in software development projects), Machine Learning iterations can take a long time, especially in the
Experimentation phase. One problem we had to tackle in the project I took part in was the difference in cadence the software development tasks had when compared to the Machine Learning tasks in the same Sprint.

Phases like Experimentation could take a long time to complete with no significant “update” to be provided back to project management other than “that experiment was a failure”. For project managers who might not be familiar with the way Machine Learning experimentation works, this could sound like no progress is being made with the same task being in the “In Progress” column for weeks, and sometimes months, while actual progress is being made that just does not have significant updates to be reported. It is important to understand this fact to accommodate every engineer in the team and to understand the project progression.

Licensing

Copyright management in Software Development is a complex topic itself. You have the age old copyright vs copyleft discussion (although more permissive and business friendly open source licensing, like Apache V2 and MIT, seems to have won the day for the moment, even with source open trend that companies like MongoDB started) in the front and center, with some companies backing Open Source communities with funds and technical contributions as a means of product development.

Although the same could be applied to the source code that trains the Model in Machine Learning, licensing the Model itself could be a tricky case. Binaries or package distributions out of software development projects can be relatively easily licensed with the source licensing, upstream licensing, and business goals in mind.

For Machine Learning Models, this could become a bit complex when it comes to considering the ownership of the data, the algorithm, and the library code used for various steps between data collection and Model training. Additionally, given how powerful some Models can become, there will be restrictions in place enforced by states on the potential users of the license. We have already seen restriction of hardware sales for specialised GPUs used for Model training introduced by the USA-China tech race, however I’m sure we will soon be seeing deep discussions on which parts of a Model and its training can truly be Open Source and which parts will be proprietary while adhering to community oriented movements like Open Source and Free Software. Of course, this doesn’t involve the API based proprietary model that most AI companies seem to follow at the moment.

Conclusion

There are various other details that stands Machine Learning projects out from rest of the software engineering practices. However these can be apparent when you get to a certain level of detail. Furthermore, some areas like data engineering, big data storage optimisation, are not ones I can comfortably discuss without significant experience.