A project aimed at disinformation detection. This repository outlines the steps involved in deploying the service using various technologies, testing and benchmarking, as well as implementing various machine learning methodologies.
Done during Projector course Machine Learning in Production
- Projector Course Work: Disinformation Detection Service
- Table of Contents
- Prerequisites
- Minio setup
- Data
- Model training
- Model optimization
- Streamlit
- Model serving
- Tests
- Benchmarks
- POD autoscaling
- Kafka
- Data drift detection
This guide assumes that you have basic knowledge in the following technologies:
- Docker
- GitHub Actions
- Kubernetes
brew install minio/stable/minio
minio server --console-address :9001 ~/minio # path to persistent local storage + run on custom port
docker run \
-p 9002:9002 \
--name minio \
-v ~/minio:/data \
quay.io/minio/minio server /data --console-address ":9002"
kubectl create -f deployment/minio.yml
Install DVC
brew install dvc
Init in repo
dvc init --subdir
git status
git commit -m "init DVC"
Move file with data and add to DVC, commit DBV data config
dvc add ./data/data.csv
git add data/.gitignore data/data.csv.dvc
git commit -m "create data"
Add remote data storage and push DVC remote config (ensure that bucket already created)
dvc remote add -d minio s3://ml-data
dvc remote modify minio endpointurl [$AWS_ENDPOINT](
git add .dvc/config
git commit -m "configure remote"
git push
Upload data
export AWS_ACCESS_KEY_ID='...'
dvc push
docker pull heartexlabs/label-studio:latest
docker run -it -p 8080:8080 -v `pwd`/mydata:/label-studio/data heartexlabs/label-studio:latest
docker build -t model-training . -f job/Dockerfile
docker run -it model-training
Run pruning:
python -m src.model.pruning
Run distillation:
python -m src.model.distilation
streamlit run src/serving/streamlit.py
Deploy k8s:
kubectl create -f deployment/app-ui.yml
kubectl port-forward --address svc/app-ui.yml 8080:8080
Deploy k8s:
kubectl create -f deployment/app-ui.yml
kubectl port-forward --address svc/app-ui.yml 8080:8080
Deploy k8s:
kubectl create -f deployment/app-fasttext.yml
kubectl port-forward --address svc/app-fasttext 8090:8090
kubectl apply -f https://github.com/datawire/ambassador-operator/releases/latest/download/ambassador-operator-crds.yaml
kubectl apply -n ambassador -f https://github.com/datawire/ambassador-operator/releases/latest/download/ambassador-operator-kind.yaml
kubectl wait --timeout=180s -n ambassador --for=condition=deployed ambassadorinstallations/ambassador
kubectl create namespace seldon-system
helm install seldon-core seldon-core-operator --version 1.15.1 --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --set ambassador.enabled=true --namespace seldon-system
Deploy k8s:
kubectl create -f deployment/seldon-custom.yaml
Deploy k8s:
kubectl create -f deployment/kserve.yaml
kubectl get inferenceservice custom-model
Run tests
pytest app/tests/
Format | Avg Write Time (s) | Avg Read Time (s) | File Size after Write (MB) |
CSV | 0.906960 | 0.174510 | 5.649742 |
JSON | 0.386252 | 1.161783 | 16.038124 |
PARQUET | 0.061314 | 0.016811 | 1.507380 |
ORC | 0.167490 | 0.016776 | 6.998336 |
CSV format shows relatively slower write times compared to other formats, with a moderate file size after write.
JSON format demonstrates faster write times but slower read times compared to other formats, with the largest file size after write.
PARQUET format showcases the fastest write times and relatively fast read times, with a smaller file size after write compared to CSV and JSON.
ORC format exhibits moderate write times and the smallest file size after write among the tested formats, with efficient read times.
locust -f benchmarks/load_test.py --host=http://localhost:9933 --users 50 --spawn-rate 10 --autostart --run-time 600s
Install metric service
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl patch -n kube-system deployment metrics-server --type=json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
Run from config
kubectl create -f deployment/app-fastapi-scaling.yml
Install kafka
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install zookeeper bitnami/zookeeper --set replicaCount=1 --set auth.enabled=false --set allowAnonymousLogin=true --set persistance.enabled=false --version 11.0.0
helm install kafka bitnami/kafka --set zookeeper.enabled=false --set replicaCount=1 --set persistance.enabled=false
# eventing
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.9.7/eventing-crds.yaml
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.9.7/eventing-core.yaml
kubectl apply -f https://github.com/knative-sandbox/eventing-kafka/releases/download/knative-v1.9.1/source.yaml
Run deployment
kubectl apply -f deployment/kafka-infra.yml
kubectl port-forward $(kubectl get pod --selector="app=minio" --output jsonpath='{.items[0].metadata.name}') 9000:9000
mc config host add myminio miniominio miniominio
mc mb myminio/input
mc mb myminio/output
mc admin config set myminio notify_kafka:1 tls_skip_verify="off" queue_dir="" queue_limit="0" sasl="off" sasl_password="" sasl_username="" tls_client_auth="0" tls="off" client_tls_cert="" client_tls_key="" brokers="kafka-headless.default.svc.cluster.local:9092" topic="test" version=""
mc admin service restart myminio
mc event add myminio/input arn:minio:sqs::1:kafka -p --event put --suffix .json
kubectl create -f deployment/kafka-infra.yml
python -m src.monitoring.drift