Merge branch 'master' into feature/l9t2-fastapi

yuriihavrylko · Feb 11, 2024 · f667dab · f667dab
2 parents 0eca8ff + 3f435b9
commit f667dab
Show file tree

Hide file tree

Showing 11 changed files with 205 additions and 0 deletions.
diff --git a/.dvc/.gitignore b/.dvc/.gitignore
@@ -0,0 +1,3 @@
+/config.local
+/tmp
+/cache
diff --git a/.dvc/config b/.dvc/config
@@ -0,0 +1,5 @@
+[core]
+    remote = minio
+['remote "minio"']
+    url = s3://ml-data
+    endpointurl = http://10.0.0.6:9000
diff --git a/.dvcignore b/.dvcignore
@@ -0,0 +1,3 @@
+# Add patterns of files dvc should ignore, which could improve
+# the performance. Learn more at
+# https://dvc.org/doc/user-guide/dvcignore
diff --git a/README.md b/README.md
@@ -22,6 +22,7 @@ DH Images:
 Works on push to master/feature*
 ![Alt text](assets/actions.png)
 
+
 ### Streamlit 
 
 Run:
@@ -31,21 +32,105 @@ streamlit run src/serving/streamlit.py
 
 ![Alt text](assets/streamlit.png)
 
+
 Deploy k8s:
 ```
 kubectl create -f deployment/app-ui.yml
 kubectl port-forward --address 0.0.0.0 svc/app-ui.yml 8080:8080
 ```
 
+
 ### Fast API 
 
 Postman
 
 ![Alt text](assets/fastapi.png)
 
 
+
 Deploy k8s:
 ```
 kubectl create -f deployment/app-fasttext.yml
 kubectl port-forward --address 0.0.0.0 svc/app-fasttext 8090:8090
 ```
+
+### DVC
+
+Install DVC
+
+```
+brew install dvc
+```
+
+Init in repo
+
+```
+dvc init --subdir
+git status
+git commit -m "init DVC"
+```
+
+Move file with data and add to DVC, commit DBV data config
+
+```
+dvc add ./data/data.csv
+git add data/.gitignore data/data.csv.dvc
+git commit -m "create data"
+```
+
+Add remote data storage and push DVC remote config
+(ensure that bucket already created)
+
+```
+dvc remote add -d minio s3://ml-data
+dvc remote modify minio endpointurl [$AWS_ENDPOINT](http://10.0.0.6:9000)
+
+git add .dvc/config
+git commit -m "configure remote"
+git push 
+```
+
+Upload data
+```
+export AWS_ACCESS_KEY_ID='...'
+export AWS_SECRET_ACCESS_KEY='...'
+dvc push
+
+
+### Label studio
+
+```
+docker pull heartexlabs/label-studio:latest
+docker run -it -p 8080:8080 -v `pwd`/mydata:/label-studio/data heartexlabs/label-studio:latest
+```
+
+![Alt text](assets/labeling.png)
+
+
+### Minio setup
+Mac/Local
+```
+brew install minio/stable/minio
+
+minio server --console-address :9001 ~/minio # path to persistent local storage + run on custom port
+```
+
+Docker
+
+```
+docker run \
+   -p 9002:9002 \
+   --name minio \
+   -v ~/minio:/data \
+   -e "MINIO_ROOT_USER=ROOTNAME" \
+   -e "MINIO_ROOT_PASSWORD=CHANGEME123" \
+   quay.io/minio/minio server /data --console-address ":9002"
+```
+
+Kubernetes
+
+```
+kubectl create -f deployment/minio.yml
+```
+
+
diff --git a/app/requirements-dev.txt b/app/requirements-dev.txt
@@ -8,3 +8,4 @@ accelerate==0.25.0
 datasets==2.16.1
 wandb==0.16.1
 httpx==0.26.0
+ipykernel==6.28.0
diff --git a/assets/labeling.png b/assets/labeling.png
diff --git a/data/.gitignore b/data/.gitignore
@@ -0,0 +1 @@
+/data.csv
diff --git a/data/data.csv.dvc b/data/data.csv.dvc
@@ -0,0 +1,5 @@
+outs:
+- md5: 7ec83b215d1790bedaf458a1690370e3
+  size: 25144581
+  hash: md5
+  path: data.csv
diff --git a/deployment/minio.yml b/deployment/minio.yml
@@ -0,0 +1,38 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: minio-deployment
+spec:
+  selector:
+    matchLabels:
+      app: minio
+  strategy:
+    type: Recreate
+  template:
+    metadata:
+      labels:
+        # Label is used as selector in the service.
+        app: minio
+    spec:
+      volumes:
+      - name: storage
+        persistentVolumeClaim:
+          claimName: minio-pv-claim
+      containers:
+      - name: minio
+        image: quay.io/minio/minio:latest
+        args:
+        - server
+        - /storage
+        env:
+        # Minio access key and secret key
+        - name: MINIO_ACCESS_KEY
+          value: "minio"
+        - name: MINIO_SECRET_KEY
+          value: "minio123"
+        ports:
+        - containerPort: 9003
+          hostPort: 9003
+        volumeMounts:
+        - name: storage
+          mountPath: "/storage"
diff --git a/experiments/train.ipynb b/experiments/train.ipynb
diff --git a/modelcard.md b/modelcard.md
@@ -0,0 +1,63 @@
+---
+language: en
+tags:
+- bert
+license: apache-2.0
+datasets:
+- GonzaloA/fake_news
+---
+
+# BERT fake news classifiction model
+
+Pretrained model on English language based on uncased version of BERT finetuned for task of binary classification.
+
+
+### How to use
+
+You can use this model directly with a pipeline for masked language modeling:
+
+```python
+
+tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
+bert_model = BertForSequenceClassification.from_pretrained(PATH, local_files_only=True)
+
+# run infernce 
+
+```
+With transformers pipeline
+
+```python
+
+text_classification_pipeline = pipeline(
+    "text-classification",
+    model=PATH,
+    tokenizer=PATH,
+    return_all_scores=True
+)
+```
+
+
+## Training data
+
+The BERT model was pretrained on [bert-base-uncased](https://huggingface.co/bert-base-uncased), a dataset consisting of ~25,000 of news labeled as fake and real.
+For training purpoose 10k of samples randomly selected and splitted in 80:20 ratio.
+
+## Training procedure
+
+### Preprocessing
+
+The texts are tokenized using BERT tokenizer.
+
+### Training
+
+The model was trained on GPU T4 x 2.
+
+## Evaluation results
+
+
+| Epoch | Training Loss | Validation Loss | Accuracy |
+|-------|---------------|-----------------|----------|
+| 1     | 0.074000      | 0.027787        | 0.986500 |
+| 2     | 0.032600      | 0.010920        | 0.995000 |
+| 3     | 0.010100      | 0.002739        | 0.999500 |
+