MLOps for Non-ML Engineers 02 - Differences Between ML and SW Dev

How is Machine Learning Different?

Continuing comparison between Machine Learning and Software Development, from the previous article in the series.

MLOps for Non-ML Engineers 01 - Introduction

MLOps for Non-ML Engineers 02 - Differences Between ML and SW Dev

MLOps for Non-ML Engineers 03 - More Differences Between ML and SW Dev

MLOps for Non-ML Engineers 04 - Unique aspects of an ML Project Execution

Understanding the solution

Various iterations of design and architecture of a software development project usually determines the approach of the solution for a software engineering problem. As the design becomes more verbose, the higher the likelihood that details can change based on new findings. These changes usually have a low cost of switching, and do not impact the end product and the development pipeline significantly. As much as initial understanding of key requirments of the problem is important, approaches like Agile Methodology try to avoid committing to a grand detailed design at the start of the project and instead develops the solution and detailed design in iterations.

In Machine Learning, “development” phase is known as Experimenting.

Experimenting is the phase where the correct approach for the solution has to be figured out. As in the EDA phase (refer to the previous article), having a good understanding of the problem is important here as well. Various different algorithms for the particular class of problems can be tried out here, however rather than trying out each algorithm for the “best result” (more details on how to determine the best result later), it helps to use the best possible candidates for this phase to reduce waste in cost and time.

Once a suitable approach is determined, iterations to get the best result for the problem starts. In software development projects, a compiled binary or package that passes all the testing effort is the release candidate (with sometimes voting involved in Open Source projects). In Machine Learning, this criteria of becoming a release candidate can vary from problem to problem. The aim of Experimentation is to figure out the best possible approach that results in the best success criteria for the given problem.

Experimentation usually uses a subset of data available for the problem. This is important for the quick results that Experimentation needs since a “build” in Machine Learning takes a longer time to complete than a source code compilation. Full set of data is used during the Training phase when the correct solution is determined.

In a branch of solutions known as Supervised Learning, input data set is split into Training set and Validation set. The input data set is (usually randomly) split in to two groups. The first group is used to train the Model and the second group is used to test the resulting Model. In some cases, this is done for multiple iterations (including randomly splitting training and test data sets) to arrive at a number of Models that would have some variation in the results they produce. Metrics, that we will be discussing later, are used to compare these Models to determine if the approach taken produces consistent results with limited variation or if it is too unstable to find a suitable pattern.

In Supervised Learning training data should be properly labelled. In some classes of problems, labelling can be a cumbersome task itself, especially given the volume of data involved. Various ways, from manual labelling by humans, to using Machine Learning for labelling itself are being used to generate proper training data sets.

Amazon SageMaker offers a service named Ground Truth which is a manual labelling service that offsets tedius but accurate data labelling to humans on AWS side. Ground Truth provides fine grain control over what kind of labelling to be done on the datasets and how accurate labelling should be (ex: multiple labelling inputs per a class of objects). As far as I understand, Ground Truth is the most in detail service in this area, may be surpassed by a handful of other third party services that may not integrate with your AWS setup as neatly as Ground Truth does.

In addition to manual labelling, Amazon SageMaker also offers automatic data labelling which uses already built Machine Learning Models to do labelling (object detection, image classification etc).

As mentioned above, getting it right in EDA and Experimenting phases is critical for a Machine Learning project. Mistakes in these phases can result in the entire effort being scrapped and having to start again, which can mean the failure of an entire project. While other software projects can, and do, fail for the same overall reasons, the wiggle room available for fixing mistakes down the line for a Machine Learning project is slim.

Amazon SageMaker’s Experiment (and Trial) concept makes it easy to track objective metrics across different experiments for comparison. This is fully integrated with the SageMaker Studio, the web base IDE for Machine Learning work by AWS, and the Amazon SageMaker SDK, so registering a particular training run as an Experiment is streamlined and easy to track afterwards. This is especially helpful if multiple members of the team are working on the same problem and are registering experiment runs on the same experiment version.

As far as training compute goes, Amazon SageMaker makes it easier to scale in and out the training process for Experimentation. More data and more compute can be added to a training process, whether it is done using one of the built-in algorithms or a custom one. Data can be distributed among multiple nodes or can be replicated fully between different training jobs.

When it comes to trying out different solutions, Amazon SageMaker allows picking from a list of built-in algorithms or customise your model training to your needs by enabling custom container approach. Additionally, “Local Mode” in SageMaker allows running containers built from your Docker image locally for quicker iterations during experimentation.

Build Quality Metrics

If the result of each iteration of an ML build from the same source code (and the same version of the data set) can be different, how would we objectively determine which one to release?

In software development projects, the problem of comparing quality of different build versions is solved with a mix of automated and manual testing. The version with the most tests passing or the most critical tests passing is typically released.

In Machine Learning, the passing criteria are more nuanced. Accuracy of Model predictions is not the ultimate metric to perfect. The typical example here is Fraud Detection. A Model with 100% accurate predictions is really easy to build.

def is_fraud(t: TransactionMeta) -> bool:
  return True

With the above code, 100% all fraudulant transactions can be detected. However, 100% of legitimate transactions will also be marked as fraud and the bank would be out of business by the end of the week.

For a successful Model, all of the following metrics should be understood.

True Positive - predicted actual fradulant transaction as fraudulant
True Negative - predicted actual legitimate transaction as legitimate
False Positive - predicted actual legitimate transaction as fraudulant
False Negative - predicted actual fraudulant transaction as legitimate

For multi-class classification problems like these, Machine Learning uses a Confusion Matrix to better understand the effectiveness of the Model.

	Actual `Yes`	Actual `No`
Predicted `Yes`	True Positive	False Positive
Predicted `No`	False Negative	True Negative

For this specific case, True Positive and True Negative percentage should be high and False Positive and False Negative cases should be low.

To standardise these measurements, derived metrics such as Recall (True Positive rate), Precision (percentage of relevant results), Specificity (True Negative rate), and RMSE (Root Mean Square Error, how deviant is prediction from actual case, commonly used for Regression class of problems) are used. The selection of which metric to base the passing criteria on, depends on the problem. For the above mentioned fraud detection case, Recall is a good candidate which focuses on reducing false negatives and false positives. In fraud detection the cost of false negatives is high. For cases where cost of false positives is high, such as drug testing, Precision could be a better metric to aim for.

Experimentation should also focus on avoiding overfitting a Model to a specific training data set. This is where training process becomes too accurate for the training data, and localises on a specific pattern in the training data that doesn’t really exist in the real world data. A Model should be adequately accurate and should have a percentage of errors for it to successfully work with real world data. Think of overfitting as hard coding a Model to training data.

Build Time

The time distribution during between developing and building of the project in typical software development is usually weighted towards development. Developers spend time designing, researching, writing, and testing code and automated pipelines do the final building process. Local compiling and building is optimised to use local caches, mirror repositories, and take at most tens of minutes to complete even a large Java project with Maven2 slowly replenishing the cache. New programming languages boast about reduced compile time. Even for extremely large code bases, builds that go beyond a few hours is rare (time spent in build queues is a different story).

In Machine Learning, Training can take days to finish. While data analysis and experimentation can take more time, the training builds take more time to finish more than any software engineering build would ever take. Adding to this is the specific hardware requirements a training job has. Depending on the algorithm, the training time could be affected by adding more CPUs and/or GPUs to the mix. However, with the volume of data involved and the complexity of the calculations involved, little can be done about the “build times” involved in Machine Learning.

The above mentioned SageMaker features like Local Mode could help in bringing down the time spent during testing. Features such as incremental training could help in doing the same for continuous training after being pushed to production. Continuous Training is a topic that will be covered later in the series.

Artefact Management

As discussed before, while a software development product can be reproduced pretty easily, a trained ML Model that might be potentially usable has to be persisted for future reference and comparison. A newer Model version, trained with what was thought to be better logic, and with different data, could end up being worse than the earlier version. This isn’t like regression bugs in software development. It’s more because of the ambiguous and black box like nature of Models and the training process.

Artefact Management isn’t a new concept for Machine Learning. But, where software development projects can use the same registry for different artefacts (binaries, resources, libraries, and dependencies), Machine Learning projects have a variation in the type of registries it will have to interact with.

Feature Stores and Data Set Management

Features, as discussed before, are engineered out of existing data and are used to develop Models out of a given data set. In an organisation, Features derived out of the data it has, could be used for different efforts in different projects. Instead of generating Features for each project, these could be made available for different teams as a Feature Store. Since EDA (Exploratory Data Analysis) and Feature Engineering can take a lot of time out of a project, having a common repository of refined Features can accelerate a Machine Learning project. Likewise, having proper access to data sets for data engineering and Data Scientists avoids having to put processes in for data access.

Model Registry

Each outcome of a training run will most likely end up as a version of a given Model type. These are usually stored in file or object storage services, however metadata related to each version is important as much as the Model itself. These metadata could include training details, validation metrics, and other useful data that can be used to compare versions and deploy into production. A Model Registry is such a metadata management registry.

Amazon SageMaker provides a fully featured Model Registry that can track a Model along with the metadata such as the hyper parameters used to train the specific version of the model and track the lineage of the Model across different Experiments and releases. It also allows an approval process to be incorporated into the Model training and deployment process to gatekeep production Model deployments.

Library and dependency management

Like other software development projects, software libraries and dependencies are part of the development process in Machine Learning projects. There will be the need to manage the supply chain for security and predictability. At the same time, in-house libraries will have to be developed to support Data Scientists to adhere to best practices when working with their development and training environment, and abstract away the platform management details from their part of the work. These will need to be stored, versioned, and made available under tight access control.

More comparison between Machine Learning and Software Development continued in the next article.