In this project, we pre-trained and fine-tuned BERT, RoBERTa and GPT2 from the HuggingFace Library to classify medical dataset. The specifics are mentioned in the experiment section. The correspoding Jupyter Notebooks can be found in this repo.
The datasets used are allpatients, zreddit, smalldataset, Reddit All, Twitter.
Out domain:
Experiments | Model | Continue pretraining | Finetune | Regression (severity) /Classification |
---|---|---|---|---|
Experiment 1 | Reddit dataset | ---- | - | dataset1/dataset2/Twitter |
Experiment 2 | Bert-RoBERTa-GPT | zReddit Patients only(without labels) | - | dataset1/dataset2/Twitter |
Experiment 3 | Bert-RoBERTa-GPT | Reddit All Patients (without labels) | - | dataset1/dataset2/Twitter |
Experiment 4 | Bert-RoBERTa-GPT | - | dataset1/dataset2 | |
Experiment 5 | Bert-RoBERTa-GPT | - | zReddit dataset (sent via email) | dataset1/dataset2/Twitter |
Experiment 6 | Bert-RoBERTa-GPT | - | AllReddit Patients and control only (i will include control text and labels) | dataset1/dataset2/Twitter |
Experiment 7 | Bert-RoBERTa-GPT | - | dataset1/dataset2/ |
In Domain:
Experiments | Model | Finetune | Classification |
---|---|---|---|
Experiment 1 | Bert | zReddit (patients and control) | zReddit |
Experiment 2 | RoBERTa | zReddit (patients and control) | zReddit |
Experiment 3 | GPT2 | zReddit(patients and control) | zReddit |
Experiment 4 | Bert | Reddit All(patients and control) | Reddit All |
Experiment 5 | RoBERTa | Reddit All (patients and control) | Reddit All |
Experiment 6 | GPT2 | Reddit All (patients and control) | Reddit All |
Due to computational constrains some models couldn't be trained by us using Google Colab. Hence, the results for these are left blank.