This repository includes a set of tools and models used in the NLP pipeline for automatic extraction and encoding of data from social media to fascilitate access and retreival of information. and Reddit are used here as independent social media sources. The NLP pipeline for automated extraction and encoding of information involves Data Collection and Preprocessing, Long Covid Filter (only for Twitter data), Entity Extraction, and Normalization components.
The Vector Institute’s Industry Innovation and AI Engineering teams worked with Roche Canada, Deloitte and TELUS to explore this question. Their project: applying natural language processing (NLP) techniques to social media posts made by people with long COVID to see if any patterns arise. This involved creating a machine learning pipeline and testing the capabilities of various NLP models on first-hand testimony from social media. The hope is that these patterns, if identified, could reveal clues about when and how frequently symptoms arise and where clusters of the condition occur. Any insights could be shared with clinicians to hone their research questions, identify trends early, or inform treatment strategies.
To learn more about this work please read our blog post. For details about our experimental setup, please read our paper titled: Towards Providing Clinical Insights on Long Covid from Twitter Data.
Here is a Kepler html .
We provide a suite of scripts to download and create twitter and reddit data along with a list of hashtags, etc. The process consists of three steps.
- Navigate to
SocialMediaData_creation/
. - There is an example config file used to highlight how to configure your run. Setup your config file with the various include, excludes, hashtags, etc.
- Run
SocialMediaData_creation/main.py
and pass the necessarykey_file
,credential_key
,config_file
, andoutput_file
parameters with-k
,-kc
,-c
, and-o
respectively, via the command line.
A sample command using the config under ./sample
python main.py -k sample/api_keys.yaml -ck v2_key -c sample/config_run.ini
Our UMLSBERT model which is finetuned on the n2c2 (2010) challenge for entity extraction is available on huggingface.
The Nomralization steps are shown in the Colan Notebooks here The Unique Concepts for the Entity mapping and nomralization can be found here The list of extracted entities before mapping can be found here
MetaMapLite is used for obtaining CUI codes used to label entities, and can be installed here.
Please consider citing our work if you found it useful in your research: