-
Notifications
You must be signed in to change notification settings - Fork 1
/
Bnapus.Rmd
188 lines (151 loc) · 6.05 KB
/
Bnapus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
title: "B napus masking"
output: html_notebook
---
OK we are having so much trouble with read mapping in B. napus that I want to try to simulate reads and see how many mismap. These regions could the be masked for SNP calling.
I will start with simulating RNAseq reads.
```{r}
library(polyester)
library(tidyverse)
library(Biostrings)
library(Rsamtools)
library(rtracklayer)
library(stringr)
# source("Rnaseq_sim_helper_functions.R") #not needed?
library(snowfall)
```
## Simulate reads
get cDNA (originally from ftp://bradata:[email protected]/Brassica_napus/Brassica_napus.annotation_v5.gff3.cds.fa.gz with a modification date of 4/15/15)
take a look...
```{r}
cdna <- readDNAStringSet("~/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3.cds.fa.gz")
cdna
head(names(cdna))
```
I can't get "meanmodel" to work, so specify # of transcripts by hand
```{r}
totalreads <- 1e8
readproportions <- width(cdna) / sum(width(cdna))
reads_per_tx <- totalreads*readproportions
sum(reads_per_tx)
mean(reads_per_tx)
mean(width(cdna)) # so ~ 1 read initiating per base, on average...
```
Takes about 6.5 hours. I guess I could have split this up onto multiprocessors and then combined...
```{r, eval=FALSE}
system.time(
simulate_experiment(fasta="~/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3.cds.fa.gz",
outdir="Bnapus_sim",
reads_per_transcript = reads_per_tx,
num_reps=1,
readlen=100,
paired=FALSE,
error_model="illumina5",
bias="rnaf",
gzip=TRUE,
fold_changes = 1
)
)
```
take a look
```{r, engine='bash'}
gunzip -cf Bnapus_sim/sample_01.fasta.gz | head -n 20
```
oops I generated paired end reads which I didn't really mean to.
We need to convert the "/" to "_" in the read names so that the transcript source goes with the read ID.
```{r, engine='bash', eval=FALSE}
gunzip -c sample_01.fasta.gz | sed "s*/*_*" | gzip -c > sample_01_reformat.fasta.gz
```
## Map using STAR
First create indexed reference
```{r, engine='bash', eval=FALSE}
STAR --runMode genomeGenerate \
--runThreadN 3 \
--genomeDir /Users/jmaloof/Sequences/ref_genomes/B_napus \
--genomeFastaFiles /Users/jmaloof/Sequences/ref_genomes/B_napus/Brassica_napus_v4.1.chromosomes.fa \
--sjdbGTFfile /Users/jmaloof/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbGTFfeatureExon CDS #not perfect because UTR fields
```
Now map reads. STAR can take fasta reads so will not convert to fastq
below doesn't work on my laptop (out of memory)...move to Whitney...
also had to change the parameters files so that STAR returned an unsorted bam file...it was running out of memory on sort even when I limited it.
```{r, engine='bash',eval=FALSE}
cd Bnapus_sim
STAR \
--parametersFiles ../STAR.params.whitney.Bnapus.1 \
--readFilesIn sample_01_reformat.fasta.gz \
--readFilesCommand 'gunzip -c' \
--outFileNamePrefix STAR_Bnapus_sim1 \
--twopassMode Basic \
--runThreadN 10
```
sorted and indexing reads
```{r, engine='bash', eval=FALSE}
samtools sort -m 15G --reference ~/Sequences/ref_genomes/B_napus/Brassica_napus_v4.1.chromosomes.fa --threads 6 STAR_Bnapus_sim1Aligned.out.bam > STAR_Bnapus_sim1Aligned.out.SORT.bam
samtools index STAR_Bnapus_sim1Aligned.out.SORT.bam
```
## Compare true and mapped position
For each read we want to know if it went to the correct location.
each read is tagged with tx name
So first we create a genomic ranges object from the GTF.
Then we step through the BAM file one gene at a time, extract the reads, and ask how many belong it. At the same time we could keep track of where the mismapping genes came from...
### get gff
```{r}
gff <- import("~/Sequences/ref_genomes/B_napus/Brassica_napus.annotation_v5.gff3",feature.type="mRNA")
gff
```
### count hits
iterate over chromosomes and then genes in the gff, pulling all reads that map to that gene and then askng how many originated from that gene
```{r, eval=FALSE}
system.time(
counts.list <- lapply(levels(seqnames(gff)), function(chr) {
#go chromosome by chromosome to keep from having memory errors
gff.small <- sort(gff[seqnames(gff)==chr])
cat(chr,"started at",paste0(lubridate::now()),"\n")
#get a list of all the reads that map to each gene on the chromosome
#each list entry corresopnds to one gene
read.list <- scanBam(file="Bnapus_sim/STAR_Bnapus_sim1Aligned.out.SORT.bam",
param = ScanBamParam(
what="qname", #grep("seq|qual|mate",scanBamWhat(),invert = TRUE,value = TRUE),
mapqFilter = 100,
#tag=c("NH","HI","AS","nM"),
which=gff.small))
#now for each gene count how many reads that map to the gene
#actually originated from that gene
t(sapply(1:length(gff.small), function(i) {
read_count <- length(read.list[[i]]$qname)
match_count <- sum(str_detect(
read.list[[i]]$qname,
unlist(gff.small[i]$Alias)))
percent_correct <- match_count/read_count*100
c(read_count=read_count, match_count=match_count, percent_correct=percent_correct)
} # function(i)
) # sapply(1:length)
) # t
} # function(chr)
) #sapply(gff.small)
) # system.time
names(counts.list) <- levels(seqnames(gff))
save(counts.list,file="Bnapus_sim/counts.list.Rdata")
```
### plot
```{r}
load("Bnapus_sim/counts.list.Rdata")
str(counts.list)
```
```{r}
sapply(counts.list,function(x) min(x[,"percent_correct"],na.rm=TRUE)) #so: worth plotting
```
```{r}
pdf("Bnapus_sim/count_plots.pdf",height = 4, width=8)
for (chr in names(counts.list)) {
counts <- as.data.frame(get(chr,counts.list))
counts$index <- 1:nrow(counts)
pl <- ggplot(counts,aes(x=index,y=percent_correct,size=read_count)) +
geom_point() + ylim(0,100)
ggtitle(chr)
print(pl)
}
dev.off()
```