This is a small set of scripts to identify integration sites and orientation of viral integrations into a host genome. The strategy is to augment the host genome with a contig representing the viral genome and look for chimeric alignments where one part of the alignment is in the host genome and the other part of the alignment is in the viral genome. We use those alignments to try to call integration sites. There are some previously existing tools to do this, such as http://bioinfo.mc.vanderbilt.edu/VirusFinder/ and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3582262/ and more recently http://www.nature.com/articles/srep11534. We had some trouble getting dependable results from some of the earlier methods so we wrote up these scripts which seem to give reasonable results.
Augment the human genome with the virus genome. Do that by catting on the virus genome as an extra chromosome to the human genome, and then index that augmented genome with bwa.
Samples must be demultiplexed. Once they are demultiplexed, align to the augmented human genome with bwa.
bwa mem -Y -t $THREADS $GENOME ${sample}_R1_001.fastq.gz ${sample}_R2_001.fastq.gz
mark duplicates with sambamba
sambamba markdup bwa-mem/$samplename.tmp.bam bwa-mem/$samplename.bam
This script keeps alignments that are chimeric for the human genome and the virus genome. In this example we used K03455 as the virus genome. This skips duplicates and identifies chimeric alignments in two ways. 1) if the alignment is marked as supplementary and is in the virus sequence and 2) if the alignment has a supplementary alignment associated with it and one of the alignments is in the virus sequence and the other is in the human sequence.
python chimeric.py K03455 bwa-mem/$samplename.bam
python orientation.py bam_file virus_contig > out.sites
Rscript combine.R out.sites out.combined.sites