This repo was created as part of an NLP class during my masters program. The goal was to classify "possible delusion" in a corpus of YouTube comments in which knowledgable informants (family, friends, caregivers) described an individual with dementia.
Read the paper which is in this repo as BMIN521-Final-Report.pdf
.
- Initialize the conda environment using
environment.yml
- Install the spacy_experimental coreference model. Note, some of the steps below are already handled in step 1, but it won't hurt to rerun them...
pip install spacy_experimental
pip install chardet
pip install thinc[torch]
pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl
Note, this was an iterative research project. It's not possible to run start to finish simply via the cli. That said, an overview of steps is provided below:
- Copy
.env_template
to.env
and populate it with your own API key / parameters - Create a list of YouTube channels and videos to extract comments from and store them in
data
. The files were named,channels_en.txt
andvids_to_search.txt
. Note, channel id's have to be acquired via the YouTube data api. Can't just use the @username. - Run
youtube_extract.py
. This will create a SQLite database if it doesn't already exist. - Run sql scripts 1-2 using your favorite SQLite interface. Note, anything annotation related will fail as you haven't created that table yet.
- Run through
notebooks/knowledgeable_informant_narratives.ipynb
- Run sql script 3.
- Sql scripts 4-5. Get some data and start annotating! I kept it simple and did this in Excel, then loaded the data back into SQLite. The first batch of annotation informed the OpenAI prompting strategy and created a rubric for use in annotating a test set.
- Sql script 6 controls which data to send to the OpenAI model.
- Run
classify.py
- Run through
notebooks/compute_results.ipynb
See final_report.pdf
for detailed overview of methods.
- 500 Response Error 2/16/2023
- No text in response. I suspect this is due to predicting a stop sequence in the first character. In playground received: "The model predicted a completion that begins with a stop sequence, resulting in no output. Consider adjusting your prompt or stop sequences."
- I relied on YouTube search to derive a corpus of dementia-related videos. People may have searched for the content, found a useful video, and begun internalizing vocabulary. Therefore not simulating an individual who has yet to find helpful content.
- Not clear why data is not present in API which is present via browser. May be result of anti-spam filters. In other cases, channel owners moderate comment content, stays in under review, but the comment count has already incremented in database. Complexity of managing a system with billions of videos, can't eventual consistency of aggregate stats. A quora post to explain.
VideoID | Browser Count | API Count | API Coll Date |
---|---|---|---|
zx7gLoPMO-s | 105 | 64 | 2023-02-08 |
zpth3xzvbjU | 40 | 40 | 2023-02-08 |
-3V9eSYR6Cs | 51 | 49 | 2023-02-11 |
1EGhhZdQ_ts | 49 | 49 | 2023-02-11 |
mJk02XI_sRA | 5245 | 3888 | 2023-02-11 |
CWnILUjkgXg | 1824 | 1565 | 2023-02-08 |
- https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
- https://platform.openai.com/docs/api-reference/parameter-details
- https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
- https://medium.com/@AlyssaSha/fine-tuning-gpt-3-using-python-for-keywords-classification-6c4970526c68