Bnapus.Rmd

---
title: "B napus masking"
output: html_notebook
---

OK we are having so much trouble with read mapping in B. napus that I want to try to simulate reads and see how many mismap.  These regions could the be masked for SNP calling.

I will start with simulating RNAseq reads.

```{r}
library(polyester)
library(tidyverse)
library(Biostrings)
library(Rsamtools)
library(rtracklayer)
library(stringr)
# source("Rnaseq_sim_helper_functions.R") #not needed?
library(snowfall)
```

## Simulate reads

get cDNA (originally from ftp://bradata:zhl410ivf@brassicadb.org/Brassica_napus/Brassica_napus.annotation_v5.gff3.cds.fa.gz with a modification date of 4/15/15)

take a look...
```{r}
cdna <- readDNAStringSet("~/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3.cds.fa.gz")
cdna
head(names(cdna))
```

I can't get "meanmodel" to work, so specify # of transcripts by hand 
```{r}

totalreads <- 1e8
readproportions <- width(cdna) / sum(width(cdna))
reads_per_tx <- totalreads*readproportions
sum(reads_per_tx)
mean(reads_per_tx)
mean(width(cdna)) # so ~ 1 read initiating per base, on average...
```

Takes about 6.5 hours.  I guess I could have split this up onto multiprocessors and then combined...
```{r, eval=FALSE}

system.time(
  simulate_experiment(fasta="~/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3.cds.fa.gz",
                      outdir="Bnapus_sim",
                      reads_per_transcript = reads_per_tx,
                      num_reps=1,
                      readlen=100,
                      paired=FALSE,
                      error_model="illumina5",
                      bias="rnaf",
                      gzip=TRUE,
                      fold_changes = 1
  )
)
```
take a look
```{r, engine='bash'}
gunzip -cf Bnapus_sim/sample_01.fasta.gz | head -n 20
```
oops I generated paired end reads which I didn't really mean to.

We need to convert the "/" to "_" in the read names so that the transcript source goes with the read ID.

```{r, engine='bash', eval=FALSE}
gunzip -c sample_01.fasta.gz | sed "s*/*_*" | gzip -c > sample_01_reformat.fasta.gz
```

## Map using STAR

First create indexed reference

```{r, engine='bash', eval=FALSE}
STAR --runMode genomeGenerate \
     --runThreadN 3 \
     --genomeDir /Users/jmaloof/Sequences/ref_genomes/B_napus \
     --genomeFastaFiles /Users/jmaloof/Sequences/ref_genomes/B_napus/Brassica_napus_v4.1.chromosomes.fa \
     --sjdbGTFfile /Users/jmaloof/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3 \
     --sjdbGTFtagExonParentTranscript Parent \
     --sjdbGTFfeatureExon CDS #not perfect because UTR fields
```

Now map reads.  STAR can take fasta reads so will not convert to fastq

below doesn't work on my laptop (out of memory)...move to Whitney...

also had to change the parameters files so that STAR returned an unsorted bam file...it was running out of memory on sort even when I limited it.

```{r, engine='bash',eval=FALSE}
cd Bnapus_sim
STAR \
               --parametersFiles ../STAR.params.whitney.Bnapus.1 \
               --readFilesIn sample_01_reformat.fasta.gz \
               --readFilesCommand 'gunzip -c' \
               --outFileNamePrefix STAR_Bnapus_sim1 \
               --twopassMode Basic \
               --runThreadN 10
```

sorted and indexing reads
```{r, engine='bash', eval=FALSE}
samtools sort -m 15G --reference ~/Sequences/ref_genomes/B_napus/Brassica_napus_v4.1.chromosomes.fa --threads 6 STAR_Bnapus_sim1Aligned.out.bam > STAR_Bnapus_sim1Aligned.out.SORT.bam

samtools index STAR_Bnapus_sim1Aligned.out.SORT.bam
```

## Compare true and mapped position

For each read we want to know if it went to the correct location.

each read is tagged with tx name

So first we create a genomic ranges object from the GTF.
Then we step through the BAM file one gene at a time, extract the reads, and ask how many belong it.  At the same time we could keep track of where the mismapping genes came from...

### get gff
```{r}
gff <- import("~/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3",feature.type="mRNA")
gff
```

### count hits

iterate over chromosomes and then genes in the gff, pulling all reads that map to that gene and then askng how many originated from that gene

```{r, eval=FALSE}
system.time(
  counts.list <- lapply(levels(seqnames(gff)), function(chr) {
    #go chromosome by chromosome to keep from having memory errors
    gff.small <- sort(gff[seqnames(gff)==chr])
    cat(chr,"started at",paste0(lubridate::now()),"\n")
    
    #get a list of all the reads that map to each gene on the chromosome
    #each list entry corresopnds to one gene
    read.list <- scanBam(file="Bnapus_sim/STAR_Bnapus_sim1Aligned.out.SORT.bam",
                         param = ScanBamParam(
                           what="qname", #grep("seq|qual|mate",scanBamWhat(),invert = TRUE,value = TRUE),
                           mapqFilter = 100,
                           #tag=c("NH","HI","AS","nM"),
                           which=gff.small))
    
    #now for each gene count how many reads that map to the gene
    #actually originated from that gene
    t(sapply(1:length(gff.small), function(i) { 
      read_count <- length(read.list[[i]]$qname)
      match_count <- sum(str_detect(
        read.list[[i]]$qname,
        unlist(gff.small[i]$Alias)))
      percent_correct <- match_count/read_count*100
      c(read_count=read_count, match_count=match_count, percent_correct=percent_correct)
      
    } # function(i)
    ) # sapply(1:length)
    ) # t
  } # function(chr)
  ) #sapply(gff.small)
) # system.time
names(counts.list) <- levels(seqnames(gff))
save(counts.list,file="Bnapus_sim/counts.list.Rdata")
```

### plot
```{r}
load("Bnapus_sim/counts.list.Rdata")
str(counts.list)
```

```{r}
sapply(counts.list,function(x) min(x[,"percent_correct"],na.rm=TRUE)) #so: worth plotting
```

```{r}
pdf("Bnapus_sim/count_plots.pdf",height = 4, width=8)
for (chr in names(counts.list)) {
  counts <- as.data.frame(get(chr,counts.list))
  counts$index <- 1:nrow(counts)
  pl <- ggplot(counts,aes(x=index,y=percent_correct,size=read_count)) +
    geom_point() + ylim(0,100)
    ggtitle(chr)
  print(pl)
}
dev.off()
```