HIPS: Hiding In Plain Site

Unannotated structural variants in public genomic data sets

Contributors:

Rebecca Torene, Alexandrea Stylianou, Jitong Cai, Ariel Gershman, Alexandra Weber, and Nancy Hansen

With thanks to former UMUC Bioinformatics master program students:

Mary Schramko
Upasana Pandey
Eric Keller
Mike Fox
Al Koroma
David Jalali

Premise:

Public data, such as gnomAD, makes VCFs of short nucleotide variants (SNVs) available. There is anecdotal evidence that some apparent “insertions” are actually part of larger structural variants

Goal:

Use VCFs of SNVs to call structural variants. In doing so, can we identify SVs that have otherwise escaped detection.

Description:

Large, aggregate, genomics data sets of short nucleotide variants (SNVs) such as gnomAD/ExAC are an invaluable resource to the research and clinical genetics communities. Such data sets are often used as a control to determine the frequencies of variants in unaffected populations and gnomAD has recently released a structural variant call set on a subset of their samples. There is evidence, however, that additional SVs exist in the data, but haven't yet been annotated. For example, this insertion is actually indicative of a known deletion. Such anecdotal findings of deletions, processed pseudogenes, inversions, tandem duplications, and mobile element insertions exist in publicly available data, hiding there in plain sight.

Without individual alignment files, the broader genomics community cannot evaluate structural variants in the complete data set. We, therefore, developed HIPS, a structural variant caller that uses VCF of SNVs as input and then outputs structural variant calls in bed format. HIPS starts from VCF, parses insertion sequences and converts to fastA, aligns by BLAST, and filters BLAST results for possible structural variants. Running the snakefile then merges similar calls and adds annotation. The Shiny app takes gnomAD data processed by HIPS and provides a web interface.

Suggested Future Improvements:

This project may be on GitHub and it may have a great name (HIPS), but we cannot deny the fact that this is the product of a 3-day hackathon and that it is imperfect. We welcome the GitHub community to make improvements and suggest the following as future improvements:

fastaToBed.py can be modified to split VCFs by position (already splits by chromosome) so that parallelization is more efficient
fastaToBed.py arguments could be refined (e.g., meiOnly could be boolean flag instead of string input)
stitch together fastaToBed.py and snakefile to have a seamless pipeline from VCF to annotated SV calls
Add more flexibility for file names, species, etc.
Option to output SV calls as VCF (currently outputs BED)
Add a low quality flag for SVs based on bitscore and based on overlap of breakpoints with simple repeats

Web Interface

hosted at https://hidinginplainsight.shinyapps.io/mergedapps/

Graph View

Table View

Results from applying to gnomAD SNV VCFs:

Getting Started

For front end development - download R and Rstudio - more info here

Once Rstudio is installed, import:

shiny
dplyr
shinyWidgets
devtools

To host the shiny app we will use:

https://www.shinyapps.io

For back end development - follow the steps below to create your environment and install an editor.

I prefer the free version of Pycharm, found here

Download miniconda and choose Python 3.7 64-bit installer.

After downloading it make sure to open a new terminal and test with `which conda`.

For example:

MDMBASTYLIANOU:~ astylianou900045$ which conda
/Users/astylianou900045/miniconda3/bin/conda

Add conda channels - do this to get all the packages needed for this project:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Then check that everything installed correctly with: conda config --show channels

For example:

MDMBASTYLIANOU:~ astylianou900045$ conda config --show channels
channels:
  - conda-forge
  - bioconda
  - defaults

Now that conda is installed, clone the repo

3. git clone https://github.com/NCBI-Hackathons/Hiding-in-plain-sight-unannotated-structural-variants-in-public-genomic-data-sets.git

Cloning will make a new directory. After it is cloned, enter that directory with 'cd'

For example:

4. cd Hiding-in-plain-sight-unannotated-structural-variants-in-public-genomic-data-sets

Time to create and activate the environment for this project:

conda env create -f environment.yml
conda activate sv_env

You will know that it was activated because sv_env will show in your command prompt, for example:

(sv_env) MDMBASTYLIANOU:~ astylianou900045$

Run test data in MEI-only mode. Should create a bed file with MEI calls.

python /path/to/repo/vcfToBed.py \
-inFile /path/to/repo/test/test.vcf.gz \
-outFile out.fa -chr 21 \
-dir /path/to/repo/resources -meiOnly True

Where:

-inFile is a gzip'ed VCF with SNV calls in hg19.
-outFile is the name of the output for the fastA
-chr is the chromosome to work on (without leading 'chr')
-dir is the directory where BLAST dictionaries are located. The MEI BLAST dictionary is located in the resources directory of this repo
-meiOnly is whether to detect only MEI structural variants.

Running on other SV types (deletions, inversions, and tandem duplications)

Download human reference sequences for each chromosome for hg19 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/README.txt
Make BLAST database for each chromosome

for f in *.fa.gz; do gunzip -k $f; done
for f in *.fa; do makeblastdb -in $f -dbtype nucl; done

Make sure your MEI BLAST database and chromosome BLAST databases are in the same directory
Run test data

python /path/to/repo/vcfToBed.py \
-inFile /path/to/repo/test/test.vcf.gz \
-outFile out.fa -chr 21 \
-dir /path/to/repo/resources -meiOnly ""

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
code		code
data		data
output		output
resources		resources
tests		tests
util		util
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
hackathon_2019.pptx		hackathon_2019.pptx
vcfToBed.py		vcfToBed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HIPS: Hiding In Plain Site

Unannotated structural variants in public genomic data sets

Contributors:

Rebecca Torene, Alexandrea Stylianou, Jitong Cai, Ariel Gershman, Alexandra Weber, and Nancy Hansen

Premise:

Goal:

Description:

Suggested Future Improvements:

Web Interface

Graph View

Table View

Results from applying to gnomAD SNV VCFs:

Getting Started

For front end development - download R and Rstudio - more info here

For back end development - follow the steps below to create your environment and install an editor.

Running on other SV types (deletions, inversions, and tandem duplications)

About

Releases

Packages

Contributors 5

Languages

License

NCBI-Hackathons/Hiding-in-plain-sight-unannotated-structural-variants-in-public-genomic-data-sets

Folders and files

Latest commit

History

Repository files navigation

HIPS: Hiding In Plain Site

Unannotated structural variants in public genomic data sets

Contributors:

Rebecca Torene, Alexandrea Stylianou, Jitong Cai, Ariel Gershman, Alexandra Weber, and Nancy Hansen

Premise:

Goal:

Description:

Suggested Future Improvements:

Web Interface

Graph View

Table View

Results from applying to gnomAD SNV VCFs:

Getting Started

For front end development - download R and Rstudio - more info here

For back end development - follow the steps below to create your environment and install an editor.

Running on other SV types (deletions, inversions, and tandem duplications)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages