The problem at hand revolves around determining the income levels of individuals registered in a past census exercise. This was originally released by the US Census bureau.
The predictions from this model will be useful in determining how to allocate resources for urban development and civil infrastructure projects. The insights gleaned from this model prediction will also aid the government in determining what kind of economic programs to initiate and implement, and better decide on how to improve the economic lot of citizens and residents.
The dataset can be accessed from this link. Certain transformations are necessary to unzip the data, convert it to csv format and append columns. A robust pipeline can be built to facilitate feeding in the data on a monthly basis, which is not covered in this PoC MLOps project.
Below is a representation of the various tools and technologies that make up the MLOps stack used in this project:
- Experiment Tracking with
MLFlow
, served from a Google Cloud Platform (GCP) Virtual Machine. - Exploratory Data Analysis, research and experimentation with
Jupyter Notebooks
in aJupyter Lab
environment. - Version control with
git
andgithub
- Training Pipeline Orchestration and Scheduling with
prefect
, deployed as aprefect orion server
in a GCP VM. Prefect Storage set in a GCP storage bucket. - Model and artifact registration using the
MLFlow model registry
, and the default artifact storage located in a Google Cloud Storage Bucket. - Model served from
MLflow
registry, using theflask
framework. - Model monitoring using
Grafana
,Prometheus
andEvidently AI
- Stream data simulated and sent to the prediction service by running the script in
./stream-generator/send_data.py
- MLOps Engineering best practices implemented, including unit tests, linting and formatting using
pylint
andblack
, as well asgit pre-commit hooks
. - Docker images for prediction and monitoring tagged and pushed to docker hub for easy redeployment in the future.
- Provision 2 Virtual Machines and 2 storage buckets on Google Cloud platform. and the other is for prefect storage.
One VM is for development, experimentation and the MLflow server.
# project-vm
ssh -i ~/.ssh/id_rsa [email protected]
Clone repository into the VM and set up the project environment by running the following in the project root directory:
sudo apt update
sudo apt upgrade
sudo apt install make
make setup
make build
One bucket is for storing the artifacts and models in MLflow
# start mlflow server
mlflow server --host 10.138.0.5 --backend-store-uri=sqlite:///mlflow.db --default-artifact-root=gs://project-mlflow-bucket/
# configure connection with prefect server (after configuring prefect server in the other VM)
prefect config set PREFECT_API_URL="http://35.247.100.48:4200/api"
# start jupyter lab
jupyter lab --ip 0.0.0.0 --port 8888 --no-browser
The other VM is for serving the prefect orion server
# prefect-vm
ssh -i ~/.ssh/id_rsa [email protected]
# configure prefect storage
prefect storage create # follow the steps to configure the other storage bucket
# configure prefect server API
prefect config set PREFECT_ORION_UI_API_URL="http://<external-ip>:4200/api"
prefect orion start --host 0.0.0.0
In the project VPC, ensure external IP addresses are created for the Virtual machines and enable the following ports to allow you connect with various services:
9696 for prediction service 27017 for mongodb 3000 for evidently service 8085 for grafana 4200 for prefect 5000 for mlflow 8888 for jupyter.
Connect locally with these services using the respective ports
# Connect locally to jupyter lab
http://35.247.121.140:8888/lab
# mlflow
http://35.247.121.140:5000/
# prefect orion
http://35.247.100.48:4200/
# Grafana Evidently dashboard
http://35.247.100.48:8085/
Here, the data was analyzed and labels were found to have class imbalance. SMOTE was applied to upsample the minority class and XGBoost Classifier was also determined to be the optimal algorithm with which the model was built. This was done in comparison with Logistic Regression, and Gradient Boosting Classifier and Random Forest Classifier.
Best model can be obtained either from the Mlflow UI, the prefect training output and the run id stored as an environment variable
mlflow artifacts download \
--run-id ${MODEL_RUN_ID} \
--artifact-path models_mlflow \
--dst-path ./models
XGBoost Pipeline experiment run with hyperoptimization
Sample artifact logged with mlflow default storage shown as the google cloud storage bucket
Start Pipeline Orchestration by creating a training agent on orion and running the following in the projecyt VM:
prefect agent start 7f3e5fba-334a-414d-83cd-9495cda6f3fd
Prefect Orion Dashboard showing training deployment task runs and scheduling
Prediction Service can be started up by running the command
docker-compose build
Evidently Data Drift Monitoring
Evidently Categorical Target Drift Monitoring
Alert rule creation for getting notifications when the drift figures cross set thresholds.
This project is still ongoing at time of submission for peer evaluation. Technical debt include
- Test coverage for integration testing.
- CI/CD pipeline with git workflows.