Table of Contents
- MLOps Notion
- About The Project
- How it works
- Getting Started
- Usage
- Acknowledgments
MLOps is designed to facilitate the installation of ML software in a production environment. Machine Learning Operations (MLOps).
The term MLOps is defined as
“the extension of the DevOps methodology to include Machine Learning and Data Science assets as first-class citizens within the DevOps ecology”
“the ability to apply DevOps principles to Machine Learning applications”
by MLOps SIG
MLOps combine machine learning model, application development and operations.
MLOps is the result by ModelOps, DataOps and DevOps.
It is the step to acquire and prepare the data to be analyzed. Typically, data is being integrated from various resources and has different formats. Collecting good data sets has a huge impact on the quality and performance of the ML model. Therefore, the data, which has been used for training of the ML model, indirectly influence the overall performance of the production system.
Data engineering pipeline:
- Data Ingestion, collecting data by using various frameworks and formats, such as as internal/external databases, data marts, OLAP cubes, data warehouses, OLTP systems, Spark, HDFS, CSV, etc.
- Exploration and Validation, data validation operations are user-defined error detection functions, which scan the dataset in order to spot some errors.
- Data Wrangling (Cleaning), is the process of re-formatting or re-structuring particular attributes and correcting errors in data.
- Data Splitting, splitting the data into training, validation, and test datasets to be used during the core machine learning stages to produce the ML model.
The core of the ML workflow is the phase of writing and executing machine learning algorithms to obtain an ML model.
Issue: model decay, the performance of ML models in production degenerate over time because of changes in the real-life data that has not been seen during the model training.
Model engineering pipeline:
- Model Training, is the process of applying the machine learning algorithm on training data to train an ML model. It also includes feature engineering and the hyperparameter tuning for the model training activity.
- Model Evaluation, validating the trained model to ensure it meets original codified objectives before serving the ML model in production to the end-user.
- Model Testing, performing the final “Model Acceptance Test” by using the hold backtest dataset to estimate the generalization error.
- Model Packaging, is the process of exporting the final ML model into a specific format (e.g. PMML, PFA, or ONNX), which describes the model, in order to be consumed by the business application.
Once we trained a machine learning model, we need to deploy it as part of a business application.
This stage includes the following operations:
- Model Serving, is the process of addressing the ML model artifact in a production environment.
- Model Performance Monitoring, is the process of observing the ML model performance based on live and previously unseen data. In particular, we are interested in ML-specific signals, such as prediction deviation from previous model performance. These signals might be used as triggers for model re-training.
- Model Performance Logging, every inference request results in the log-record.
Afterwards in the picture is represents a machine learning model life cycle inside an average organization today. We can observe that is involves many different people with completely different skill sets and who are often using entirely different tools.
The complete MLOps process includes three broad phases of “Designing the ML-powered application”, “ML Experimentation and Development”, and “ML Operations”.
All three phases are interconnected and influence each other.
The level of automation of the Data, ML Model, and Code pipelines determines the maturity of the ML process. With increased maturity, the velocity for the training of new models is also increased.
The objective of an MLOps team is to automate the deployment of ML models into the core software system or as a service component. There are three levels of MLOps automation, starting from the initial level with manual model training and deployment, up to running both ML and CI/CD pipelines automatically.
We are interested in the identity, components, versioning, and dependencies of these ML artifacts. The target destination for an ML artifact may be a (micro-) service or some infrastructure components.
A deployment service provides orchestration, logging, monitoring, and notification to ensure that the ML models, code and data artifacts are stable.
The goal of the versioning is to treat ML training scrips, ML models and data sets by tracking ML models and data sets with version control systems. With data scientists building, testing, and iterating on several versions of models, they need to be able to keep all the versions straight.
Furthermore, every ML model specification should be versioned in a VCS to make the training of ML models auditable and reproducible.
In general, reproducibility in MLOps also involves the ability to easily rerun the exact same experiment. Data scientists may neet to have the ability to go back to different "branches" of the experiments—for example, restoring a previous state of a project.
ML reproducibility must provide relevant metadata and information to reproduce models. Model metadata management includes the type of algorithm, features and transformations, data snapshots, hyperparameters, performance metrics, verifiable code from source code management, and the training environment.
Experimentation takes place throughout the entire model development process, and usually every important decision or assumption comes with at least some experiment or previous research to justify it.
Data scientists need to be able to quickly iterate through all the possibilities for each of the model building blocks.
There are four key metrics to measure and improve ones ML-based software delivery:
- Deployment Frequency, how often does your organization deploy code to production or release it to end-users?
- Lead Time for Changes, how long does it take to go from code committed to code successfully running in production?
- Mean Time To Restore, how long does it generally take to restore service when a service incident or a defect that impacts users occurs?
- Change Fail Percentage, what percentage of changes to production or released to users result in degraded service and subsequently require remediation?
These are the same for capture the effectivenes of the software development and delivery of elite/high performing organisations.
This project puts into practice the steps of MLOps and it is complete using the Production phase (Observability phase) at link https://ProductionPhase_project.com.
The following image illustrates how the Develop phase works. The entire development process is managed by workflow orchestration, which in cycle performs several steps, each of which is executed by a specific tool.
📚 Theory: This apply CI/CD methodology. The desire in MLOps is to automate the CI/CD pipeline as far as possible.
As Workflow orchestration is used Kedro, an open-source Python framework for creating reproducible, maintainable and modular data science code. Kedro is a template for new data engineering and data science projects. This tool provide to organize all MLOps steps in a well-defined pipeline.
When you initialize a Kedro project, with the command:
kedro new
after are automatic creted needed folders and files. The most important of these are as follows:
conf/base/
catalog.yml
fileparameters/
folder, with the parameters for each pipelines
data/
folder, with all data sets and other output, separate for additional named folders, as:01_raw
,02_intermediate
,06_models
,08_reporting
logs/
foldersrc/kedro_ml/
pipeline_registry.py
file, with the pipeline namespipelines/
folder, with specified each pipelines
src/requirements.txt
file, with needed models
- Data Catalog
- It makes the datasets declarative, rather than imperative. So all the informations related to a dataset are highly organized.
- Node
- It is a Python function that accepts input and optionally provides outputs.
- Pipeline
- It is a collection of nodes. It create the Kedro DAG (Directed acyclic graph).
In the project Data Catalog is implemented in conf/base/catalog.yml
.
Here is define each type, destination filepath and if is versioned about the data sets, in output from the nodes.
Here you can define all your data sets by using simple YAML syntax.
Documentation for this file format can be found at link: Data Catalog Doc
Examples:
model_input_table:
type: pandas.CSVDataSet
filepath: data/04_feature/model_input_table.csv
versioned: true
regressor:
type: pickle.PickleDataSet
filepath: data/06_models/regressor.pickle
versioned: true
metrics:
type: tracking.MetricsDataSet
filepath: data/09_tracking/metrics.json
versioned: true
The next is more complex, so this plot is showed also in the Kedro interactive visualization platform Kedro-Viz.
plot_feature_importance_img:
type: plotly.PlotlyDataSet
filepath: data/08_reporting/plot_feature_importance_img.json
versioned: true
plotly_args:
type: bar
fig:
x: importance
y: feature
orientation: h
layout:
xaxis_title: importance
yaxis_title: feature
title: Importance for feature
As you can see:
The node functions are write in nodes.py
file of the respective pipeline folder.
def preprocess_activities(activities: pd.DataFrame) -> Tuple[pd.DataFrame, Dict]:
"""Preprocesses the data for activities.
Args:
activities: Raw data.
Returns:
Preprocessed data and JSON file.
"""
activities = _validation(activities)
activities = _wrangling(activities)
return activities, {"columns": activities.columns.tolist(), "data_type": "activities"}
The pipeline definition is write in the pipeline.py
file in the respective pipeline folder.
From example we can see the nodes, with their function, input, output, name. These must match with the nodes implementation (see previously) and the 'name's are repeated in the Data Catalog (see previously).
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_activities,
inputs="activities",
outputs=["preprocessed_activities", "activities_columns"],
name="preprocess_activities_node",
),
node(
func=exploration_activities,
inputs="activities",
outputs="exploration_activities",
name="exploration_activities_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_activities", "params:table_columns"],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
)
Kedro requires installation:
pip install kedro
Kedro have also a GUI, called Kedro-Viz. Kedro-Viz is a interactive visualization of the entire pipeline. It is a tool that can be very helpful for explaining what you're doing to people.
To see the kedro ui:
kedro viz
To see the kedro ui go to the 270.0.0.1:4141
browser page.
📚 Theory: versioning is essential to reproduce the experiments. Reproducibility in MLOps also involves the ability to easily rerun the exact same experiment.
Ad data versioning management tool is used DVC. This provides a way to handle large files, data sets, machine learning models, and metrics.
When you initialize dvc setting:
dvc init
is created .dvc
folder, where the most important file is:
config
, with the url to remote destination, where save the data.
DVC stores information about the added file in a special .dvc
file named data/data.xml.dvc
, this metadata file is a placeholder for the original data.
In this case we would save the original datasets, so data/01_raw/DATA.csv
.
In this project is used own Google Drive, the code at path .dvc/config
is:
[core]
remote = storage
autostage = true
['remote "storage"']
url = gdrive://1LMUFVzJn4CNaqVbMsVGZazii4Mdxsanj
DVC needs to be installed:
pip install dvc
To update data in remote storage, we use the next commands:
dvc add data/01_raw/DATA.csv
This for update or create the file .dvc
.
We can see in folder .dvc/cache/
in the corrispective folder there are the data to save.
dvc push data/01_raw/DATA.csv
Now in Google Drive appear a new folder with the new data version.
Note: If the data to save is not changed (and also saved), the file .dvc
is not update and not appear another folder in Google Drive.
DVC is a more usefull tool. As Data manager, we can also create data pipeline and specify the metrics, parameters and plots. DVC is also a Experiment manager, providing comparison and visualize experiment results.
For more: LINK MIO FILE Part 3
📚 Theory: Perform exploratory data analysis (EDA) is when data scientists or analysts consider available data sources to train a ML model
pandas is a Python package providing fast, flexible, and expressive data structures.
it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.
pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame. If it is 1-D is called Series.
It require the installation, also with conda:
pip install pandas
In the code, pandas Dataframes are used in nodes as funciontion input and output. For example:
def preprocess_activities(activities: pd.DataFrame) -> Tuple[pd.DataFrame, Dict]
def split_data(data: pd.DataFrame, parameters: Dict) -> Tuple
Examples of pandas methods:
apps: pd.DataFrame
# Drop duplicates
apps.drop_duplicates(inplace = True)
# Calculate the MEAN, and replace any empty values with it
x = apps["Average Heart rate (tpm)"].mean()
apps["Average Heart rate (tpm)"].fillna(x, inplace = True)
# Clean rows that contain empty cells
apps.dropna(inplace = True)
activities: pd.DataFrame
totalNumber = activities.size
maxDistance = activities["Distance (km)"].max()
meanAverageSpeed = activities["Average Speed (km/h)"].mean()
minAverageHeartRate = activities["Average Heart rate (tpm)"].min()
📚 Theory: The data scientist implements different algorithms with the prepared data to train various ML models. In addition, you subject the implemented algorithms to hyperparameter tuning to get the best performing ML model.
scikit-learn, or sklearn, is an open source machine learning library.
It is a simple and efficient tools for predictive data analysis and it also provides various tools for model fitting, data preprocessing, model selection, model evaluation.
For change the parameters, update conf/base/parameters/data_science.yml
file with settings use during model management.
model_options:
test_size: 0.2
val_size: 0.25
random_state: 42
max_depth: 2
features:
- Distance (km)
- Average Speed (km/h)
- Calories Burned
- Climb (m)
- Average Heart rate (tpm)
For passing as input this parameters to nodes, this specification is write in file src/kedro_ml/pipelines/data_science/pipeline.py
. For example:
node(
func=split_data,
inputs=["model_input_table", "params:model_options"],
name="split_data_node",
outputs=["X_train", "X_test", "X_val", "y_train", "y_test", "y_val"],
),
Note: the next codes are take from src/kedro_ml/pipelines/data_science/nodes.py
.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=parameters["test_size"], random_state=parameters["random_state"])
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=parameters["val_size"], random_state=parameters["random_state"])
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(max_depth=parameters["max_depth"], random_state=parameters["random_state"])
regressor.fit(X_train, y_train)
from sklearn.model_selection import GridSearchCV
# define search space
space = dict()
space['max_depth'] = [1,2,3]
space['random_state'] = [41,42,43,44]
# define search
search = GridSearchCV(regressor, space, scoring='neg_mean_absolute_error')
# execute search
result = search.fit(X_train, y_train)
from sklearn import metrics
# MAE to measure errors between the predicted value and the true value.
mae = metrics.mean_absolute_error(y_val, y_pred)
# MSE to average squared difference between the predicted value and the true value.
mse = metrics.mean_squared_error(y_val, y_pred)
# ME to capture the worst-case error between the predicted value and the true value.
me = metrics.max_error(y_val, y_pred)
It needs to be installed:
pip install -U scikit-learn
📚 Theory: ML experiment steps are orchestrated and done automatically. Experiment environment is used in the preproduction and production environment, which is a key aspect of MLOps practice for unifying DevOps.
MLflow is an open source platform for managing the end-to-end machine learning lifecycle.
MLFlow can be very helpful in terms of tracking metrics over time. We can visualize that and communicate what is the progress over time. MLFlow centralize all of these metrics and also the models generates.
MLflow and Kedro are tools complementary and not conflicting:
- Kedro is the foundation of your data science and data engineering project
- MLflow create that centralized repository of metrics and progress over time
For specify more options is used MLproject
file:
name: kedro mlflow
conda_env: conda.yaml
entry_points:
main:
command: "kedro run"
So when run mlflow is execute the command kedro run
.
It needs to be installed:
pip install mlflow
To log file:
mlflow.log_artifact(local_path=os.path.join("data", "01_raw", "DATA.csv"))
mlflow.log_artifact(local_path=os.path.join("data", "04_feature", "model_input_table.csv", dirname ,"model_input_table.csv"))
mlflow.log_artifact(local_path=os.path.join("data", "08_reporting", "feature_importance.png"))
mlflow.log_artifact(local_path=os.path.join("data", "08_reporting", "residuals.png"))
To log model:
mlflow.sklearn.log_model(sk_model=regressor, artifact_path="model")
To log key-value param:
mlflow.log_param('test_size', parameters["test_size"])
mlflow.log_param('val_size', parameters["val_size"])
mlflow.log_param('max_depth', parameters["max_depth"])
mlflow.log_param('random_state', parameters["random_state"])
To log key-value metric:
mlflow.log_metric("accuracy", score)
mlflow.log_metric("mean_absolute_erro", mae)
mlflow.log_metric("mean_squared_error", mse)
mlflow.log_metric("max_error", me)
To set key-value tag:
mlflow.set_tag("Model Type", "Random Forest")
Need Python version 3.7. Using conda:
conda create -n env_name python=3.7
conda activate env_name
You can run mlflow project with:
mlflow run . --experiment-name activities-example
To run mlflow project in Windows, you can run mlflow project with:
mlflow run . --experiment-name activities-example --no-conda
You can run ui as follows:
mlflow ui
To see the mlflow ui go to the 270.0.0.1:5000
browser page.
📚 Theory: the validated model is deployed to a target environment to serve predictions. In Continuous Deployment (CD) the deploied system should automatically deploy the model prediction service.
BentoML, on the other hand, focuses on ML in production. By design, BentoML is agnostic to the experimentation platform and the model development environment. BentoML only focuses on serving and deploying trained models.
MLFlow focuses on loading and running a model, while BentoML provides an abstraction to build a prediction service, which includes the necessary pre-processing and post-processing logic in addition to the model itself.
BentoML is more feature-rich in terms of serving, it supports many essential model serving features that are missing in MLFlow, including multi-model inference, API server dockerization, built-in Prometheus metrics endpoint and many more.
BentoML stores all packaged model files under the ~/bentoml/repository/{service_name}/{service_version}
directory by default. The BentoML packaged model format contains all the code, files, and configs required to run and deploy the model.
bentofile.yaml
service: "service:svc"
include:
- "service.py"
- "src/kedro_ml/pipelines/data_processing/nodes.py"
conda:
environment_yml: "./conda.yaml"
docker:
env:
- BENTOML_PORT=3005
The BentoML basic steps are two:
- save the machine learning model
- create a prediction service
In src/pipeline/data_science/nodes.py
file, to save the model is used to help MLflow, included in BentoML module.
import bentoml
bentoml.mlflow.import_model("my_model", model_uri= os.path.join(os.getcwd(), 'my_model', dirname))
service.py
file
def predict(input_data: pd.DataFrame):
with open(os.path.join("conf", "base", "parameters", "data_science.yml"), "r") as f:
configuration = yaml.safe_load(f)
with open('temp.json', 'w') as json_file:
json.dump(configuration, json_file)
output = json.load(open('temp.json'))
parameters = {"header":output["model_options"]["features"]}
input_data = create_model_input_table(input_data, parameters)
input_data, dict_col = preprocess_activities(input_data)
print("Start the prediction...")
return model_runner.predict.run(input_data)
Bento is a file archive with all the source code, models, data files and dependency configurations required for running a user-defined bentoml.Service, packaged into a standardized format. Bento is crete with the command:
bentoml build
The three most common deployment options with BentoML are:
- 🐳 Generate container images from Bento for custom docker deployment
- 🦄️ Yatai: Model Deployment at scale on Kubernetes
- 🚀 bentoctl: Fast model deployment on any cloud platform
We containerize Bentos as Docker images allows users to easily distribute and deploy bentos. With the command:
bentoml containerize activities_model:latest
BentoML requires installation:
pip install bentoml
To see all bento models:
bentoml models list
To see more about a bento model:
bentoml models get <name_model>:<number_version>
To build a bento model:
bentoml build
To start Bento model in production:
bentoml serve <name_model>:latest --production
If use Windows run Bento Server without -- reload
:
bentoml serve service:svc --reload
or more general:
bentoml serve
After you can open a web page 127.0.0.1:3000
to have a model serving.
📚 Theory: A Deployment pipeline is the process of taking code from version control and making it readily available to users quickly and accurately.
kedro-docker is a plugin to create a Docker image and run kedro project in a Docker environment.
The settings are in Dockerfile
file.
For set the number port:
EXPOSE 3030
Note: by default the bento port is 3000, but it is also the same port as granafa, another tool used in Production phase.
For set the command, one of the commands:
CMD ["kedro", "run"]
CMD [ "python3", "-m" , "flask", "run", "--host=0.0.0.0", "--port=3030"]
Note: if we want it to be just the kedro pipeline then we use CMD ["kedro", "run"], otherwise if we want it to also be capable of more interactivity with production pahse we use the second CMD, thus activating the apllication.
To install, run:
pip install kedro-docker
For create docker image of Kedro pipeline.
kedro docker build --image pipeline-ml
To run the docker model:
docker run <name_model>
or to production:
docker run <name_model> serve --production
Or to run the project in a Docker environment:
kedro docker run --image <image-name>
To interact with pipeline and all step, there is run.py
which answer to command lines. It makes easier to do tasks such as opening tool gui, creating a new model and its bento, and updating dataset. The avaiable command lines are:
For communication between this project and production phase, there is a Flask application with avaible API:
To run Flask application, usefull for Production phase:
flask run --host=0.0.0.0 --port=3030
All necessary installations are present at the src/requirements.txt
kedro
kedro[pandas.CSVDataSet, pandas.ExcelDataSet, pandas.ParquetDataSet]
kedro-viz
scikit-learn
matplotlib
seaborn
numpy
mlflow
bentoml
dvc
kedro-docker
requests
flask
At 270.0.0.1:4141
browser page.
From here we can also see and compare the experiments, that are the versions created runned the kedro project.
At 270.0.0.1:5000
browser page.
From this page we can select a single experiment and see more information about it.
At web page 127.0.0.1:3000
.
- ml-ops.org
- neptune.ai
- mlebook.com
- cloud.google.com about MLOps
- Made With ML
- Book "Introducing MLOps How to Scale Machine Learning in the Enterprise (Mark Treveil, Nicolas Omont, Clément Stenac etc.)"