This pipeline is designed to identify significant patterns and relationships within a set of abstracts, offering insights into potential genetic contributions within the biomedical domain.
The topic modeling and data retrieval pipeline is available as Jupyter notebook (grpm_bertopic.ipynb
) and is designed to unravel hidden topic among PubMed genetic literature.
data
: Harvest the data produced during the notebook execution.utils
: Contains accessory python code.grpm_bertopic.ipynb
: The Jupyter notebook file contains all the steps, code and detailed information about the GRPM BERTopic analysis process.bertopic_tutorial.ipynb
: Jupyter notebook created for educational purposes.
To perform the GRPM BERTopic analysis, follow the steps laid out in the grpm_bertopic.ipynb
notebook. Following these steps, you'll be able to unravel the intricate connections between genetic variations and MeSH term provided.
The general workflow has been depicted below:
The GRPM BERTopic Analysis utilizes a structured approach to extract and examine themes from a collection of scholarly abstracts related to human genetic polymorphisms. Below outlines the key steps of the analysis:
The analysis starts by retrieving source dataset of scholarly abstracts focusing on human genetic polymorphisms. This dataset, termed the GRPM Dataset, integrates data from sources like LitVar and PubMed.
The preprocessing phase utilizes a user-defined set of Medical Subject Headings (MeSH) terms to curate the corpus of abstracts. An example of MeSH terms is available in the data/ref-mesh.csv
file. This step is crucial as it refines the abstracts' corpus, preparing it for effective topic modeling.
The refined corpus undergoes topic modeling using the BERTopic architecture. This framework employs advanced hierarchical clustering techniques to uncover the latent thematic structures of the abstracts, providing a comprehensive overview of the topic model's underlying architecture.
Finally, selected topics undergo post-processing, highlighting specific themes for in-depth exploration. This stage enhances the understanding of genetic influences pertinent to biomedical fields, as specified by the custom MeSH terms.
If you encounter any issues or have any questions, feel free to open an issue in this repository.
All required libraries and their specific versions used for this project are listed within the grpm_bertopic.ipynb
. Make sure to install these dependencies before go through the notebook.
For didactic purposes, a commented use case is available, unfolding each BERTopic component.