Skip to content

Latest commit

 

History

History
90 lines (69 loc) · 4.31 KB

README.md

File metadata and controls

90 lines (69 loc) · 4.31 KB

Build Status

Introduction

This is a simple script to generate fasta formatted sequences that include IUPAC symbols for the different SNPs in the region.

What happens under the hood is that the script will fetch all the SNPs in the given region using Ensembl's REST API. Then, different filters can be used to restrict the list to variants with an available MAF (available, in this case, means that it can be found using the Ensembl API) or to variants that have a MAF higher than a given threshold. This can be used to keep only fairly common variants.

Installation

This script has a hard dependency on gepyto. After installing this package, the script can be used like any other unix executable. It is compatible with both Python 2.7+ and Python 3.

To install gepyto, simply clone or download the package and use:

python setup.py install

Usage

The usage is fairly straightforward:

./variant_aware_fasta.py --help

# usage: variant_aware_fasta.py [-h] [--maf-over MAF_OVER] [--maf-exists] region
# 
# Generate a FASTA where the SNPs are represented as their IUPAC code.
# 
# positional arguments:
#   region               The region for sequence extraction (e.g.
#                        chr3:12345-12567)
# 
# optional arguments:
#   -h, --help           show this help message and exit
#   --maf-over MAF_OVER  Filter on the minor allele frequency.
#   --maf-exists         If this flag is used, only variants with available
#                        Ensembl MAFs will be considered.

You just call the script with a region using the chrXX:START-END format. Optional arguments are --maf-over to filter for common variants and --maf-exists to keep only variants that have a MAF defined by the Ensembl database.

The reverse complement of the sequence is also output.

Example

./variant_aware_fasta.py chr11:2549137-2549192
# > chr11:2549137-2549192 with IUPAC coded variants (maf > 0)
# TGCYRTGTCCCTGTYTTGCAGCTTCCTCMTCRTCCYGGTCTKCYTCATCTTYRGYR
# > reversed_compl_chr11:2549137-2549192 with IUPAC coded variants (maf > 0)
# YRCYRAAGATGARGMAGACCRGGAYGAKGAGGAAGCTGCAARACAGGGACAYRGCA

The previous call also generated a tab separated file named variants_in_11_2549137-2549192.txt containing the following information:

chrom pos rs ref alt evidence maf most_severe_consequence
11 2549140 rs377350869 C T ESP Intron variant
11 2549141 rs28730661 G A Multiple_observations,Frequency,1000Genomes,ESP 0.00826446 Intron variant
11 2549151 rs201682200 C T Multiple_observations,Frequency,ESP Splice region variant
11 2549165 rs199472684 A C Cited Missense variant
11 2549168 rs199473449 G A Cited Missense variant
11 2549172 rs199472685 T C Cited Missense variant
11 2549178 rs199472686 G T Cited Missense variant
11 2549180 rs199473450 C T Cited Missense variant
11 2549188 rs149143353 C T Frequency,ESP Synonymous variant
11 2549189 rs120074192 A G Multiple_observations,Cited Missense variant
11 2549191 rs377520734 C T ESP Synonymous variant
11 2549192 rs199472687 G A Cited Missense variant

Testing

Very basic testing is also available:

python -m unittest -q variant_aware_fasta