Note: Infrastructure is tested and built only for arm64 system. Live application uses an arm64 EC2 Ubuntu machine
This repository provides an end-to-end MLOps infrastructure setup, automating the deployment of a Kubernetes cluster using KIND (Kubernetes in Docker) and deploying various services such as FastAPI, KubeRay, ArgoCD, MinIO, MLflow, and more. The infrastructure is designed to support machine learning workflows, including model training, deployment, monitoring, and serving, using tools like Ray, MLflow, and MinIO.
Click To Expand
- Automated Cluster Setup: Automated creation and configuration of a KIND Kubernetes cluster on ARM64 chip.
- Containerized Applications: Dockerized FastAPI application and frontend for model training and inference.
- Distributed Computing: Integration with Ray and KubeRay for distributed computing and model serving.
- CI/CD with ArgoCD: Continuous integration and deployment using ArgoCD for application deployment.
- Object Storage with MinIO: MinIO deployed as an S3-compatible object storage for data and model artifacts.
- Experiment Tracking with MLflow: MLflow for tracking experiments, model registry, and model management.
- Monitoring and Alerts: Redis and frontend application for monitoring metrics and sending alerts.
- Pydantic Models: Defined Pydantic models for request validation in the FastAPI application.
- Nginx App: Single page application to view model performance and monitor alerts
- Kubernetes: Orchestrates containerized applications.
- KIND (Kubernetes in Docker): Runs Kubernetes clusters locally using Docker containers.
- Helm: Manages Kubernetes applications using Helm charts.
- ArgoCD: Implements GitOps continuous delivery for Kubernetes.
- Ray and KubeRay: Provides distributed computing capabilities for Python.
- FastAPI: Web framework for building APIs.
- MLflow: Manages the ML lifecycle, including experimentation, reproducibility, and deployment.
- MinIO: High-performance, S3-compatible object storage.
- Redis: Acts as a message broker for server-sent events (SSE) in the FastAPI application, used as a message queue for alerts
- Docker: Containerization platform for applications.
The init.sh
script automates the deployment process. Below are the detailed steps performed by the script:
-
Create and Configure KIND Cluster
- Checks if a KIND cluster named
kind-cluster
exists; if not, it creates one using the configuration inconfig/kind-config.yaml
. - Sets the Kubernetes context to the new cluster.
- Checks if a KIND cluster named
-
Set Up Namespaces
- Applies the namespaces configuration from
manifests/namespaces.yaml
.
- Applies the namespaces configuration from
-
Configure Kubernetes Dashboard
- Deploys the Kubernetes dashboard for cluster management.
- Creates an admin user and retrieves the access token.
-
Add Helm Repositories
- Adds necessary Helm repositories for ArgoCD, KubeRay, MinIO, MLflow, Nginx, Redis
- Updates the Helm repositories.
-
Build and Load Docker Images
- Builds Docker images for the FastAPI application and the frontend (nginx).
- Loads these images into the KIND cluster.
-
Create Secrets
- Creates Kubernetes secrets for MinIO credentials in the
fastapi
andmlflow
namespaces.
- Creates Kubernetes secrets for MinIO credentials in the
-
Install ArgoCD
- Installs ArgoCD via Helm in the
argocd
namespace using custom Helm values. - Waits for ArgoCD to be ready and forwards its service port to localhost.
- Installs ArgoCD via Helm in the
-
Install MLflow
- Installs MLflow via Helm in the
mlflow
namespace using custom Helm values. - Sets environment variables for MLflow to connect to MinIO.
- Installs MLflow via Helm in the
-
Deploy Frontend Application
- Deploys the frontend application using nginx.
- Javascript events to listen to webhooks (FastAPI endpoint)
/webhook
-
Install KubeRay Operator and Cluster
- Installs KubeRay operator and Ray cluster via Helm in the
kuberay
namespace.
- Installs KubeRay operator and Ray cluster via Helm in the
-
Install Redis
- Installs Redis via Helm in the
db
namespace using custom Helm values. - Forwards Redis service port to localhost.
- Installs Redis via Helm in the
-
Install MinIO
- Creates secrets for MinIO access and root users.
- Installs MinIO via Helm in the
minio
namespace using custom Helm values.
-
Deploy FastAPI Application Using ArgoCD
- Applies the ArgoCD application manifest to deploy the FastAPI application from the Git repository.
The FastAPI application uses Pydantic models for request validation. The models are defined as follows:
- ScheduleTrainingRequest
class ScheduleTrainingRequest(BaseModel):
minutes: int
hyperparameters: Dict[str, Any]
- InferenceRequest
class InferenceRequest(BaseModel):
input_data: List[float]
model_version: Optional[str] = None
retries: Optional[int] = 3
sla_seconds: Optional[int] = 60
- TrainingRequest
class TrainingRequest(BaseModel):
hyperparameters: Dict[str, Any]
- WatchModelRequest
class WatchModelRequest(BaseModel):
minutes: int
These models ensure that the API endpoints receive the expected data types and structures.
After running the init.sh
script, various services are accessible via localhost ports:
- Kubernetes Dashboard: http://localhost:8001
- ArgoCD Dashboard: http://localhost:8080
- MLflow UI: http://localhost:5555
- Ray Dashboard: http://localhost:8265
- Redis Endpoint: http://localhost:6379
- MinIO Console: http://localhost:9001
- Frontend App: http://localhost:5001
- FastAPI Swagger Docs: http://localhost:8888/docs
Log into MinIO Console and create a bucket mlops
. Inside the bucket, create a folder data
and upload iris.csv
. Complete path is hardcoded in the application: s3://mlops/data/iris.csv
Once a model is trained and visible in MLFlow UI, it is not automatically available for inference. Please run this command to start the model serving using ray using serve.py
kubectl exec -it {any-fastapi-pod-name} -n fastapi -- python /var/task/fastapi/serve.py
The FastAPI application provides several endpoints:
/trigger_training
: Triggers a model training job./inference
: Performs model inference. (Required to runserve.py
before invoking)/schedule_training
: Schedules periodic model training./watch_model
: Watches the model for updates and sends alerts. Useful to know if model hasn't been trained forn
minutes/kill_scheduled_job/{job_id}
: Cancels a scheduled job./get_metrics
: Retrieves metrics for the latest model and averages of past week/webhook
: Server-Sent Events endpoint for real-time updates.
To trigger model training, send a POST request to /trigger_training
with the desired hyperparameters.
Example payload:
{
"hyperparameters": {
"n_estimators": 100,
"max_depth": 5
}
}
To perform inference, send a POST request to /inference
with the input data.
Example payload:
{
"input_data": [0.1, 1.2, 2.3, 2.2]
}
To schedule periodic training, send a POST request to /schedule_training
with the interval in minutes and hyperparameters.
Example payload:
{
"minutes": 10,
"hyperparameters": {
"n_estimators": 100,
"max_depth": 5
}
}
To monitor the model for updates, send a POST request to /watch_model
with the interval in minutes.
Example payload:
{"minutes": 10}
To retrieve model metrics, send a GET request to /get_metrics
.
Ensure you have the following tools installed on your Mac M2:
- Docker: For running containers.
- Kubernetes CLI (kubectl): For interacting with Kubernetes clusters.
- Helm: Package manager for Kubernetes.
- Kind: Tool for running local Kubernetes clusters using Docker containers.
- Git: Version control system.
MAC OS
brew install kubectl helm kind docker git
Ubuntu OS
- Install docker
- Install
kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
chmod +x kubectl
mv ./kubectl ~/.local/bin/kubectl
- Install
KIND
[ $(uname -m) = aarch64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.24.0/kind-linux-arm64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
- Install
helm
wget https://get.helm.sh/helm-v3.16.2-linux-amd64.tar.gz
tar -zxvf helm-v3.16.2-linux-amd64.tar.gz
mv linux-amd64/helm /usr/local/bin/helm
git clone https://github.com/yourusername/yourrepository.git
cd yourrepository
chmod +x init.sh
./config/init.sh
After the script completes, you can access the services as described in the How To Use The App section.
-
Port Forwarding: The script sets up port forwarding for various services. Ensure that the ports are not in use by other applications. View
outputs/urls.txt
-
Credentials: Access credentials for ArgoCD, MLflow, and MinIO are stored in
outputs/credentials.txt
. -
Cleanup: To delete the cluster and resources, you can use the following commands:
kind delete cluster --name kind-cluster
- Purpose: Acts as the main API interface for triggering training jobs, performing inference, scheduling tasks, and monitoring models.
- Interconnections:
- Ray: Utilizes Ray for distributed model training and inference tasks.
- MLflow: Interacts with MLflow for experiment tracking and model registry.
- MinIO: Loads data and stores model artifacts in MinIO (S3-compatible storage).
- Redis: Uses Redis as a message queue for server-sent events (SSE) to notify the frontend application.
- Purpose: Provides distributed computing capabilities for efficient model training and inference.
- Interconnections:
- FastAPI: The FastAPI application submits training and inference tasks to the Ray cluster.
- Serve: Ray Serve is used to deploy and serve models for inference.
- Purpose: Tracks machine learning experiments, logs metrics, and manages model versions.
- Interconnections:
- FastAPI: The application logs metrics and models to MLflow.
- MinIO: Stores model artifacts in MinIO.
- Ray: The training jobs running on Ray interact with MLflow for logging.
- Purpose: Acts as an S3-compatible object storage for datasets and model artifacts.
- Interconnections:
- FastAPI: Loads datasets from MinIO and saves model artifacts.
- MLflow: Uses MinIO as the backend storage for artifacts.
- Purpose: Serves as a message queue for server-sent events (SSE) notifications.
- Interconnections:
- FastAPI: Publishes messages to Redis for events like model training completion or alerts.
- Frontend: Subscribes to Redis to receive and display alerts and metrics.
- Purpose: Implements GitOps continuous delivery, automating the deployment of the FastAPI application from the Git repository.
- Interconnections:
- Kubernetes: Manages the deployment of the FastAPI application to the cluster. The manifest is pulled from GIT
https://github.com/purijs/mlops
- Git Repository: Monitors the repository for changes and synchronizes the application state.
- Kubernetes: Manages the deployment of the FastAPI application to the cluster. The manifest is pulled from GIT
- Purpose: Provides a web interface for visualizing metrics and receiving alerts.
- Interconnections:
- Redis: Subscribes to messages for displaying alerts.
- FastAPI: May interact with FastAPI endpoints to fetch metrics.
Sample GIT workflows are integrated for connecting building of Docker files for integration with ArgoCD pipeline
- ArgoCD showing FastAPI deplyoment
- Model artifacts on Minio (S3)
- Frontend Alert: Model not trained for
x
minutes
- Frontend Alert: Model achieves more than 90% accuracy
- MLFlow Model Logging
- MLFlow Experiment Logging
- Ray Jobs
- Ray Serve (Model Inference)