This repository is the result of my bachelor project as a three-week long internship in Fellay Lab, EPFL under the supervision of Sina Rüeger. This project aims to perform a genome-to-genome (G2G) study of HBV-infected individuals from different populations, i.e. find associations between host single nucleotide polymorphisms (SNPs) and viral amino acid variants. The project eventually focuses on a genetically-related asian subpopulation.
- plink version 1.9
- plink version 2
- Pyhattan
- assocplots
- Standard data science packages in Python (pandas, numpy, scipy, matplotlib, seaborn)
- Consistent data storage accoding to the paths defined in
src/setup.py
The analyses is performed directly inside notebooks, and some of them store processed data. Thus the order of the notebooks (see below) matters. One can convert the notebooks to PDF with nbconvert
, optionally with --execute
to re-run the notebooks.
- Clinical data:
.csv
file - Viral sequencing data:
.csv
file - Host exome sequencing data:
.ped
and.map
files
All computations are performed in notebooks, which one has to run in the following order:
- Clinical data notebook: process the clinical data from the
csv
file. Stores a DataFrame binary object. - Viral data notebook: process viral data from a
csv
. Stores a processed DataFrame in a binary file. - Joint viral and clinical data notebook: combine the two datasets. PCA colored with genotypes
- Host genotype data preparation notebook: quality control, application of filters
- Host genotype data analysis notebook: PCA, association analyses, clustering
- G2G of asian subpopulation: prepare new dataset, try monovariate models, implement multivariate models
- G2G computer: multivariate models computation, analysis of results
- Interpretation of results: extract and analyse significant associations
Useful resources and references:
- tutorial/Statistics notebook: put altogether relevant information about statistical theoretical background
- tutorial/Plink introduction notebook: basic procedures and commands of plink. Mainly follows the official tutorial.
- tutorial/Plink and Python notebook: how to import plink files into python
- tutorial/Scitas tutorial: run jobs and launch Jupyter on remote server
- tutorial/HapMap notebook: processing example data (official tutorial)