One of the biggest paradigm changes in technology is the ongoing widespread use of Artificial Intelligence (AI) and Machine Learning (ML). The term AI and ML is used interchangeably however ML is a subset of AI. The fundamental concept and the mathematics behind ML is not really new. The continuous improvements in hardware performance, software innovations, and particularly data availability made it all possible to realize practical real life applications.
To build and deploy a ML model, we need to perform several independent steps in a certain order. The set of these steps is known as ML pipeline. The figure below illustrates a typical ML pipeline.
Data preparation
Data preparation is the first step in the ML pipeline, and probably one of the most important among all. The term “garbage in garbage out” correctly fits this. ML requires data to be formatted in a very specific way, so datasets generally require some amount of preparation or precessing before they can yield useful insights and inferences.
Data preparation yields many substeps. The very first challenge is to collect the most relevant data. Data could be collected from a single source or even from multiple sources. Data from a single source is easier to handle than multiple sources due to consistency challenges. A careful inspection is required to check the consistency of the data. In a multiple sources data collection it is often seen that some sources have less reliable data and some more. There is really no quick fix to this, and it will remain a challenge as the number of sources to data collection will generally increase.
Once data is collected, whether it’s from a single source or multiple sources, it needs to be analysed to see how clean and consistent it is. This involves checking missing values, wrong formats such as whether numbers are formatted as strings etc.
Data can come with outliers and anomalies. These may occur when data is collected from unknown or unreliable sources. The outliers and anomalies should be carefully inspected, and we need to make sure that they are included due to some errors for naturally occurring ones. This step requires a lot of manual inspection and therefore often is very time consuming.
Data dimensionality reduction is another important step in the data preparation step. It can help to reduce the time required to train ML models, and can also help to avoid over-fitting. Data set can come with a large number of features, however many features could be redundant or do not have good correlation. The irrelevant and the features with single unique values are normally dropped.
About 60-80% of time is spent on fetching, and preparing data for the ML pipeline. That’s a pretty big part of the ML development process. To get the data which is properly labeled is not easy.
Source: https://e-technologynews.com/data-preparation-for-machine-learning-still-requires-humans-techtarget/
ML itself can be helpful in some of the data preparation activities. Many tools started utilizing ML to find anomalies, pattern detection, cleaning and providing assistance to data engineers. Data labeling is one area where ML equipt tools are helpful to reduce the manual time spent to build models. ML enhanced tools will continue to improve however it will still remain a challenging task.
Training ML model
Despite the fact that the time spent on training ML models is generally less compared to the other steps but complexity of creating the models is quite huge. Creating a good ML model from scratch requires a good foundation of mathematical understanding. It is also a very hot research area and it is evolving very fast. Training a ML model requires a learning algorithm. The learning algorithm takes data as an input and performs learning to find patterns. Once the model is trained it can be used to predict outcomes on a new set of data.
There are several types of ML models. Most commonly used are regression, multiclass classification and binary classification and so on. The models are pretty standard now, and usually require few adjustments to be fit for the problem. Most models come with hyperparameters to select from. Having chosen different values for hyperparameters impact on the performance and accuracy of the model. This process is called hyperparameter optimization. One can try all the possibilities of the hyperparameter combinations (brute forcing), but this will take a humongously long time and thus not preferred.
Probably the biggest challenge in training ML models is the memory size and computation power. Training a model with a large set of features and large amount of data could be very time consuming. In particular, training a model with optimal hyperparameters is even more time consuming. The good news is that all these processes are mostly automated, which means once data is prepared, and the model is selected, the rest could be done without any human intervention.
ML is becoming more and more democratized, which results in having a large set of ML models available and shared free of cost. However, to find a fully optimal ML model continues to be a challenge.
Deploying model
Once a model is trained, and tested for its accuracy then it is ready for the deployment in a production environment. Training a model is an iterative process. Since new data keeps coming the model can be retrained and may improve its performance.
One can think of traditional code deployment, with a number of ML specific additions. For example, ML pipeline depends on both codes and data. The data part is probably the most fragile one, and can have a large number of changes over time. It’s essential to know data well so in deployment we know weather ML issues are due to codes or data. Moreover, the more data we have, the better it can be trained.
Reproducibility is an important part of development and operation, which implies quality of logs, dependencies, versioning, data collection, feature engineering and many other areas have to be done in their highest quality. Reproducibility should be included from the beginning. In ML applications many things can go wrong, which may not be picked up by the traditional unit, component and integration tests. Configuration, hyperparameters, deploying wrong versions of trained models, feature selection, models trained on outdated datasets are the most common ones which usually don’t follow traditional unit and integration test patterns.
CI (Continuous Integration) is slightly different in ML. It’s no longer only about testing and validating codes and components, but also testing and validating data, data schemas and models. Similarly CD (Continuous Deployment) is no longer about a single software or a server, but a system that should automatically deploy another service.
A new property (CT) is gaining popularity in ML systems. CT stands for Continuous Training. The concept of CT is to include automatic retraining and serving models.
MLOps is a similar concept to present popular DevOps. However, MLOps focus to increase automation and improve the quality of production ML while focusing on business requirements. MLOps deals with the entire lifecycle – starting from integrating with ML model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics. However, there are several barriers that make it difficult to implement ML across the organizations. The most prominent ones are scalability, collaboration, business uses, diagnostics, reproducibility and model predictions, deployment and automation.
Monitoring model
The deployed ML model should be monitored continuously in order to detect any possible errors or performance issues that may not fulfill the required performance criteria. We basically need automatic detection mechanisms, such as critical alarms and key performance metric reporting.
Before we can even deploy a machine learning model we need a good monitoring strategy. We need to track all the input and output variables. Insufficient monitoring may lead to incorrect models left unchecked in production, or errors in models that appear over time and never get caught.
Beside the classic software application monitoring, the followings are the most important ones to monitor:
- Data health: data could change over time, it’s essential to keep monitoring the changes.
- Features: the set of features used in testing should be monitored.
- Models performance: this can also vary over time with the introduction of new data input, thus needs to be monitored continuously.
Several ML metrics are developed and can be incorporated in ML monitoring. The top most used metrics are:
- Mean Squared Error (MSE) Score
- Type 1 and 2 Errors
- Accuracy and Precision
- Recall
- F1 Score
- R-Squared
Data may also be sensitive or have requirements to meet certain compliance standards such as GDPR. We need to keep monitoring whether data continues to meet the compliance standards.
Conclusion
In conclusion, building a robust ML learning pipeline is quite challenging task due to its dynamic nature of data and ML model training. Changes in data is often very unpredictable, and this unpredictability requires special consideration. However, the conventional software development principles still continue to apply but in addition with ML model training and ML focused deployment and monitoring. MLOps is an emerging discipline and we expect that in future MLOps will become more mature and relevant.