A natural language processing approach to evaluating translational value of AI in biomedical research using NIH award data. This analysis applies K-means clustering to data from awarded NIH grant applications to identify categories of grant topics in an unsupervised manner and highlight differences in estimated translational value between topics.
If using this code base, please cite: Eweje, F. R. et al. Translatability Analysis of National Institutes of Health–Funded Biomedical Research That Applies Artificial Intelligence. JAMA Netw. Open 5, e2144742 (2022).
Install pipenv if not already installed: pip install pipenv
. This project requires python version 3.7.
Perform the analysis by running the run.sh shell script from the project directory: sh run.sh
.
This performs the following:
pipenv lock --clear
- initializes python virtual environmentpipenv install
- initializes python virtual environmentpipenv run python setup.py
- Set up directory structure and install necessary NLTK modulespipenv run python nih_reporter_query.py --search_terms "search_terms.txt" --operator "advanced" --start_year 1985 --end_year 2021
- querypipenv run python feature_extraction.py --max_df 0.1 --max_features 500
- Performs feature extraction with document corpuspipenv run python find_k.py --trials 5 --max_k 120 --num_features 500
- empiric search for Kpipenv run python analyze_clusters.py --k ### --trials ###
- creates the clusters with K-Means Clustering and analyzes funding and citation data. k = number of clusters, trials = number of clustering trials to run
Awards and publications are collected from the NIH RePORTER database while citation data are collected from the NIH iCite search tool. The query can be executed with "and", "or", or advanced (mixed "and"/"or") logic per the NIH RePORTER API. If the query is an and/or query "search_terms.txt" should be formated as a list:
term 1
term 2
term 3
...
If the query is an advanced query, "search_text.txt" should be a single line formatted RePORTER query:
( \"dna\" or \"rna\" ) and ( \"machine learning\" or "\artificial intelligence\" )
The feature extraction script extracts the desired number of TF-IDF features from the dataset and also summarizes NIH funding in the data by funding institute, by year, and by funding mechanism.
The optimal number of clusters K (topics within the dataset) can be determined empirically using the find_k.py script, which allows monitoring of silhouette score and sum of squared errors with modulation of K.
Results from each run are returned in the "results" directory:
- actual_vs_projected.png - linear regression comparing projected 2021 funding by cluster to actual 2021 funding. Quality of this analysis assumes that funding for your dataset has followed an exponential trend year-to-year
- centroids.txt - text file with centroid words listed for each cluster
- clusters - csv files containing awards assigned to each cluster (1985-2020)
- clusters_test - csv files containing awards assigned to each cluster (2021)
- final_data.csv - summary table
- umap.png - UMAP visualization of clusters
- supp_info.docx - Microsoft word document with tables contain 5 representative awards from each cluster, selected by maximum silhouette score.
- model_clustering.pkl - pickle containing dictionary with the following keys:
- "yr_avg_cost" - average award funding by cluster
- "yr_total_cost" - total award funding by cluster
- "size" - cluster size
- "data_by_cluster" - nested lists of dictionaries representing individual awards assigned to each cluster
- "centroids" - list of lists of centroids by cluster (first 10 elements)
- "score" - Silhouette score
- "model" - MiniBatchKMeans model
- "complete_centroids" - list of lists of centroids by cluster (all elements)
- "labels" - ordered list of cluster labels by award (same order as data loaded by from data.pkl)
- "mechanisms" - mechanisms # List of lists: [r01, u01, r44, u24, r21, u54]
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── README.md
├── analyze_clusters.py
├── data
│ ├── by_funder.csv
│ ├── by_mechanism.csv
│ ├── by_year.csv
│ ├── citations.csv
│ ├── data.pkl
│ ├── features
│ ├── nih_institutes.csv
│ ├── processed-data.pkl
│ ├── publications.csv
│ ├── raw_data.csv
│ ├── test-data.pkl
│ └── vectorizer.pkl
├── feature_extraction.py
├── figures
│ ├── ...
├── find_k.py
├── nih_reporter_query.py
├── results
│ ├── 12-15-2021--214341
│ │ ├── centroids
│ │ ├── clusters
│ │ │ ├── cluster-0.csv
│ │ │ ├── cluster-1.csv
│ │ │ ├── ...
│ │ │ ├── cluster-30.csv
│ │ ├── clusters_test
│ │ │ ├── cluster-0.csv
│ │ │ ├── cluster-1.csv
│ │ │ ├── ...
│ │ │ ├── cluster-30.csv
│ │ ├── final_data.csv
│ │ ├── model_clustering.pkl
│ │ ├── supp_info.docx
│ │ └── umap.png
├── run.sh
├── search_terms.txt
└── setup.py