A pipeline for a local single nucleotide variant (SNV) and indel enrichment analysis near polymorphic transposable element (TE) sites.
This workflow is based on 1) TE annotation via RepeatMasker, 2) BWA-mapped Illumina paired-end reads, 3) polymorphic TE site identification via MELT, and 4) indel and SNV calls via FreeBayes. It uses a combination of Shell scripts, Python, and R.
- TE annotation of reference genome with RepeatMasker and selection of desired TEs
- BWA read mapping and FreeBayes variant calling
- Polymorphic TE identification with MELT
- Polymorphic TE quality filtering and reformatting
- SNV/Indel quality filtering and enrichment analysis
The workflow for MELT runs using simulated TE insertion data is also available in the simulations folder.
- Python v2.7/v3.6.0 (https://www.python.org/)
- R v3.6.2 (https://www.r-project.org/)
- RepeatMasker v4.0.9 (http://repeatmasker.org/)
- Bedtools2 v2.26.0 (https://bedtools.readthedocs.io/en/latest/)
- BWA v0.7.17 (http://bio-bwa.sourceforge.net/)
- Picard toolkit v2.5.0 (http://broadinstitute.github.io/picard/)
- SAMtools v1.9 (http://www.htslib.org/)
- BCFtools v1.9 (http://www.htslib.org/)
- FreeBayes v1.2.0 (https://github.com/freebayes/freebayes) - uses Python v2.7 until v1.3.2
- GATK v3.8.0 (https://gatk.broadinstitute.org/hc/en-us)
- VCFtools v0.1.16 (https://vcftools.github.io/index.html)
- MELT v2.2.0 (https://melt.igs.umaryland.edu/) - requires Java and Bowtie2
NOTE: The program versions listed are those used in the original pipeline, but newer versions should be compatible.
Please note that these scripts often have hard-coded directory and queue variables, and shell scripts are set up for a UGE cluster.