From f83542138cea2a7acb5be58a5403b7c930231089 Mon Sep 17 00:00:00 2001 From: Yurii Havrylko Date: Sun, 25 Feb 2024 21:19:20 +0100 Subject: [PATCH] improve readme --- README.md | 263 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 145 insertions(+), 118 deletions(-) diff --git a/README.md b/README.md index 1495b51..3c878b7 100644 --- a/README.md +++ b/README.md @@ -1,189 +1,223 @@ -## Projector course work -Skeleton for project on projector course - -### Docker - -Build +## Projector Course Work: Disinformation Detection Service +A project aimed at disinformation detection. This repository outlines the steps involved in deploying the service using various technologies, testing and benchmarking, as well as implementing various machine learning methodologies. + +Done during Projector course [Machine Learning in Production](https://prjctr.com/course/machine-learning-in-production) + +## Table of Contents +- [Projector Course Work: Disinformation Detection Service](#projector-course-work-disinformation-detection-service) +- [Table of Contents](#table-of-contents) +- [Prerequisites](#prerequisites) +- [Minio setup](#minio-setup) +- [Data](#data) + - [DVC](#dvc) + - [Label studio](#label-studio) +- [Model training](#model-training) +- [Model optimization](#model-optimization) +- [Streamlit](#streamlit) +- [Model serving](#model-serving) + - [Fast API](#fast-api) + - [Seldon](#seldon) + - [Kserve](#kserve) +- [Tests](#tests) +- [Benchmarks](#benchmarks) + - [File formats](#file-formats) + - [Load testing](#load-testing) +- [POD autoscaling](#pod-autoscaling) +- [Kafka](#kafka) +- [Data drift detection](#data-drift-detection) + + +## Prerequisites +This guide assumes that you have basic knowledge in the following technologies: +- Docker +- GitHub Actions +- Kubernetes + +## Minio setup +Mac/Local ``` -docker build --tag yuriihavrylko/prjctr:latest . +brew install minio/stable/minio + +minio server --console-address :9001 ~/minio # path to persistent local storage + run on custom port ``` -Push -Build +Docker + ``` -docker push yuriihavrylko/prjctr:latest +docker run \ + -p 9002:9002 \ + --name minio \ + -v ~/minio:/data \ + -e "MINIO_ROOT_USER=ROOTNAME" \ + -e "MINIO_ROOT_PASSWORD=CHANGEME123" \ + quay.io/minio/minio server /data --console-address ":9002" ``` -DH Images: -![Alt text](assets/images.png) +Kubernetes -### GH Actions: +``` +kubectl create -f deployment/minio.yml +``` -Works on push to master/feature* -![Alt text](assets/actions.png) +## Data +### DVC -### Streamlit +Install DVC -Run: ``` -streamlit run src/serving/streamlit.py +brew install dvc ``` -![Alt text](assets/streamlit.png) +Init in repo -Deploy k8s: ``` -kubectl create -f deployment/app-ui.yml -kubectl port-forward --address 0.0.0.0 svc/app-ui.yml 8080:8080 +dvc init --subdir +git status +git commit -m "init DVC" ``` -Deploy k8s: +Move file with data and add to DVC, commit DBV data config + ``` -kubectl create -f deployment/app-ui.yml -kubectl port-forward --address 0.0.0.0 svc/app-ui.yml 8080:8080 +dvc add ./data/data.csv +git add data/.gitignore data/data.csv.dvc +git commit -m "create data" ``` +Add remote data storage and push DVC remote config +(ensure that bucket already created) -### Fast API - -Postman +``` +dvc remote add -d minio s3://ml-data +dvc remote modify minio endpointurl [$AWS_ENDPOINT](http://10.0.0.6:9000) -![Alt text](assets/fastapi.png) +git add .dvc/config +git commit -m "configure remote" +git push +``` +Upload data +``` +export AWS_ACCESS_KEY_ID='...' +export AWS_SECRET_ACCESS_KEY='...' +dvc push +``` +### Label studio -Deploy k8s: ``` -kubectl create -f deployment/app-fasttext.yml -kubectl port-forward --address 0.0.0.0 svc/app-fasttext 8090:8090 +docker pull heartexlabs/label-studio:latest +docker run -it -p 8080:8080 -v `pwd`/mydata:/label-studio/data heartexlabs/label-studio:latest ``` -### Seldon +![Alt text](assets/labeling.png) -Instalation -``` -kubectl apply -f https://github.com/datawire/ambassador-operator/releases/latest/download/ambassador-operator-crds.yaml -kubectl apply -n ambassador -f https://github.com/datawire/ambassador-operator/releases/latest/download/ambassador-operator-kind.yaml -kubectl wait --timeout=180s -n ambassador --for=condition=deployed ambassadorinstallations/ambassador -kubectl create namespace seldon-system +## Model training -helm install seldon-core seldon-core-operator --version 1.15.1 --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --set ambassador.enabled=true --namespace seldon-system +Build +``` +docker build -t model-training . -f job/Dockerfile ``` -Deploy k8s: +Run ``` -kubectl create -f deployment/seldon-custom.yaml +docker run -it model-training ``` -### Kserve +## Model optimization -Deploy k8s: +Run pruning: ``` -kubectl create -f deployment/kserve.yaml -kubectl get inferenceservice custom-model +python -m src.model.pruning ``` - -### Load testing - -![Alt text](assets/locust.png) +Run distillation: ``` -locust -f benchmarks/load_test.py --host=http://localhost:9933 --users 50 --spawn-rate 10 --autostart --run-time 600s +python -m src.model.distilation +``` -### DVC -Install DVC +## Streamlit +Run: ``` -brew install dvc +streamlit run src/serving/streamlit.py ``` -Init in repo +![Alt text](assets/streamlit.png) +Deploy k8s: ``` -dvc init --subdir -git status -git commit -m "init DVC" +kubectl create -f deployment/app-ui.yml +kubectl port-forward --address 0.0.0.0 svc/app-ui.yml 8080:8080 ``` -Move file with data and add to DVC, commit DBV data config - +Deploy k8s: ``` -dvc add ./data/data.csv -git add data/.gitignore data/data.csv.dvc -git commit -m "create data" +kubectl create -f deployment/app-ui.yml +kubectl port-forward --address 0.0.0.0 svc/app-ui.yml 8080:8080 ``` -Add remote data storage and push DVC remote config -(ensure that bucket already created) +## Model serving -``` -dvc remote add -d minio s3://ml-data -dvc remote modify minio endpointurl [$AWS_ENDPOINT](http://10.0.0.6:9000) +### Fast API -git add .dvc/config -git commit -m "configure remote" -git push -``` +Postman -Upload data -``` -export AWS_ACCESS_KEY_ID='...' -export AWS_SECRET_ACCESS_KEY='...' -dvc push +![Alt text](assets/fastapi.png) -### Label studio +Deploy k8s: ``` -docker pull heartexlabs/label-studio:latest -docker run -it -p 8080:8080 -v `pwd`/mydata:/label-studio/data heartexlabs/label-studio:latest +kubectl create -f deployment/app-fasttext.yml +kubectl port-forward --address 0.0.0.0 svc/app-fasttext 8090:8090 ``` -![Alt text](assets/labeling.png) +### Seldon +Installation -### Minio setup -Mac/Local ``` -brew install minio/stable/minio +kubectl apply -f https://github.com/datawire/ambassador-operator/releases/latest/download/ambassador-operator-crds.yaml +kubectl apply -n ambassador -f https://github.com/datawire/ambassador-operator/releases/latest/download/ambassador-operator-kind.yaml +kubectl wait --timeout=180s -n ambassador --for=condition=deployed ambassadorinstallations/ambassador -minio server --console-address :9001 ~/minio # path to persistent local storage + run on custom port -``` +kubectl create namespace seldon-system -Docker +helm install seldon-core seldon-core-operator --version 1.15.1 --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --set ambassador.enabled=true --namespace seldon-system +``` +Deploy k8s: ``` -docker run \ - -p 9002:9002 \ - --name minio \ - -v ~/minio:/data \ - -e "MINIO_ROOT_USER=ROOTNAME" \ - -e "MINIO_ROOT_PASSWORD=CHANGEME123" \ - quay.io/minio/minio server /data --console-address ":9002" +kubectl create -f deployment/seldon-custom.yaml ``` -Kubernetes +### Kserve + +Deploy k8s: ``` -kubectl create -f deployment/minio.yml +kubectl create -f deployment/kserve.yaml +kubectl get inferenceservice custom-model ``` -### Tests + +## Tests Run tests ``` pytest app/tests/ ``` -### Benchmarks +## Benchmarks -Fileformats +### File formats ![Alt text](assets/format_benchmark.png) @@ -203,38 +237,33 @@ JSON format demonstrates faster write times but slower read times compared to ot PARQUET format showcases the fastest write times and relatively fast read times, with a smaller file size after write compared to CSV and JSON. ORC format exhibits moderate write times and the smallest file size after write among the tested formats, with efficient read times. -======= -### POD autoscaling -Install metric service -``` -kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml -kubectl patch -n kube-system deployment metrics-server --type=json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]' -``` +### Load testing -Run from config +![Alt text](assets/locust.png) ``` -kubectl create -f deployment/app-fastapi-scaling.yml +locust -f benchmarks/load_test.py --host=http://localhost:9933 --users 50 --spawn-rate 10 --autostart --run-time 600s ``` +## POD autoscaling -### Model optimization - -Run pruning: +Install metric service ``` -python -m src.model.pruning +kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml +kubectl patch -n kube-system deployment metrics-server --type=json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]' ``` -Run distilation: +Run from config ``` -python -m src.model.distilation +kubectl create -f deployment/app-fastapi-scaling.yml ``` -### Kafka + +## Kafka Install kafka ``` @@ -268,11 +297,9 @@ mc admin service restart myminio mc event add myminio/input arn:minio:sqs::1:kafka -p --event put --suffix .json kubectl create -f deployment/kafka-infra.yml - - ``` -### Data drift detetion +## Data drift detection ``` python -m src.monitoring.drift