-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8966eb5
commit e3df00b
Showing
1 changed file
with
322 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,325 @@ | ||
--- | ||
title: "Examples" | ||
--- | ||
|
||
Below you will find various examples of how to use `dupsifter`. Example datasets can be found in the `example` directory | ||
of the `dupsifter` source code. | ||
|
||
## Basic Usage | ||
|
||
Input to `dupsifter` comes from either the terminal's standard input (i.e., streamed input) or with an input BAM file. | ||
The default output is to stream the output, but it can also be written straight to an output file (`-o`/`--output`). | ||
|
||
Tools such as `biscuit` and `bwa-meth` stream their output. Examples of reading streamed input are: | ||
```bash | ||
# Streamed input and streamed output with BISCUIT | ||
biscuit align -@ 8 -b 1 hg38.fa read1.fastq read2.fastq | \ | ||
dupsifter hg38.fa | \ | ||
samtools sort -@ 2 -o biscuit.sorted.markdup.bam - | ||
|
||
# Streamed input and output BAM with bwa-meth | ||
bwameth.py \ | ||
--threads 8 \ | ||
--reference hg38.fa \ | ||
read1.fastq \ | ||
read2.fastq | \ | ||
dupsifter -o bwameth.unsorted.markdup.bam hg38.fa | ||
``` | ||
|
||
On the other hand, tools such as `bismark` and `bsbolt` default to writing a BAM file as output from their alignment. | ||
Below is an example of using an input BAM and writing to an output BAM or streaming the output to another tool: | ||
```bash | ||
# Input BAM and output BAM with Bismark | ||
bismark \ | ||
--parallel 2 \ | ||
--genome hg38 \ | ||
-1 read1.fastq \ | ||
-2 read2.fastq | ||
|
||
dupsifter -o bismark.unsorted.markdup.bam hg38.fa read1_bismark_bt2_pe.bam | ||
|
||
# Input BAM and streamed output with BSBolt | ||
bsbolt Align \ | ||
-t 8 \ | ||
-DB hg38 \ | ||
-O bsbolt.unsorted \ | ||
-F1 read1.fastq \ | ||
-F2 read2.fastq | ||
|
||
dupsifter hg38.fa bsbolt.unsorted.bam | \ | ||
samtools sort -@ 2 -o bsbolt.sorted.markdup.bam | ||
``` | ||
|
||
It should be noted that `dupsifter` expects it's input to be read-name grouped (i.e., reads with the same name next to | ||
one another in the BAM). For `biscuit`, `bismark`, `bsbolt`, and `bwa-meth`, the reads are output in this way. However, | ||
`gemBS` performs position sorting during its alignment. Therefore, if you want to use `dupsifter` with BAMs from | ||
`gemBS`, then you will have to group the reads by name either with `samtools sort` or `samtools collate`. An example | ||
with `samtools sort` is shown below: | ||
```bash | ||
# Alignment with gemBS | ||
# gembs.hg38.conf and gembs.sample.conf are configuration files that are specified ahead of time | ||
# The output BAM from gemBS in this example is gembs.position_sorted.bam | ||
gemBS prepare -c gembs.hg38.conf -t gembs.sample.conf | ||
gemBS map | ||
|
||
samtools sort -n -o gembs.name_sorted.bam gembs_position_sorted.bam | ||
|
||
dupsifter -o gembs.name_sorted.markdup.bam hg38.fa gembs.name_sorted.bam | ||
``` | ||
|
||
Note, if you supply a position sorted BAM to `dupsifter`, it will produce an error message along the lines of: | ||
```default | ||
[dupsifter] ERROR: Can't find read 1 and/or read 2 in 1 reads with read ID: <name of read>. | ||
Are these reads coordinate sorted? | ||
``` | ||
|
||
## Additional Usage | ||
|
||
### Single-end vs Paired-end | ||
|
||
`dupsifter` defaults to running for paired-end reads. In cases where single-end reads were aligned, include the | ||
`-s`/`--single-end` option: | ||
```bash | ||
dupsifter --single-end -o ouput.bam hg38.fa input.bam | ||
``` | ||
|
||
### Adding Mate Tags | ||
|
||
For aligners that don't generate mate tags (`MC` and `MQ`), `dupsifter` includes an option (`-m`/`--add-mate-tags`) for | ||
adding mate tags during the duplicate marking stage: | ||
```bash | ||
dupsifter --add-mate-tags -o ouput.bam hg38.fa input.bam | ||
``` | ||
|
||
Note, if a read already has one or both of the mate tags and `--add-mate-tags` is included, the pre-existing tags will | ||
not be overwritten. | ||
|
||
### Duplicate Marking WGS Data | ||
|
||
While `dupsifter` was written for WGBS data, it can also be used to mark WGS data with the `-W`/`--wgs-only` option: | ||
```bash | ||
bwa mem hg38.fa read1.fq.gz read2.fq.gz | \ | ||
dupsifter --wgs-only -o output.bam hg38.fa | ||
``` | ||
|
||
### Adjusting the Maximum Read Length | ||
|
||
For long-read data, the default maximum read length may not be sufficient. In this case, you can increase the maximum | ||
length with the `-l`/`--max-read-length` option: | ||
```bash | ||
dupsifter --max-read-length 20000 -o output.bam hg38.fa input.bam | ||
``` | ||
|
||
Note, for current short-read technologies, the default value is more than sufficient. | ||
|
||
### Adjusting the Minimum Base Quality | ||
|
||
For cases where the bisulfite strand information is not included, `dupsifter` infers the strand based on the number of | ||
C→T and G→A substitutions. By default, all bases in a read are used for counting the number of each | ||
substitution. However, in the case of a user wanting to use bases with a higher base quality, you can adjust the minimum | ||
base quality with `-b`/`--min-base-qual`: | ||
```bash | ||
dupsifter --min-base-qual 20 -o output.bam hg38.fa input.bam | ||
``` | ||
|
||
Note, this option is only used when the bisulfite strand has to be inferred and is ignored otherwise. | ||
|
||
### Duplicate Marking for Reads with Cell Barcodes | ||
|
||
For experiments that include cell barcodes, `dupsifter` is able to include the cell barcode as part of its signature | ||
with the `-B`/`--has-barcode` option: | ||
```bash | ||
dupsifter --has-barcode -o output.bam hg38.fa input.bam | ||
``` | ||
|
||
For more details on how the cell barcode can be included in the read entry, see `dupsifter --help` or the | ||
[Methodology](methodology.qmd) page. | ||
|
||
### Removing Duplicates | ||
|
||
Typically, downstream analysis tools will ignore reads flagged as duplicates that are left in the BAM file, but in cases | ||
where duplicates reads must be removed, you can remove duplicates with the `-r`/`--remove-dups` option: | ||
```bash | ||
dupsifter --remove-dups -o output.bam hg38.fa input.bam | ||
``` | ||
|
||
## Test Cases | ||
|
||
BAM files aligned with `biscuit`, `bismark`, `bsbolt`, `bwa-meth`, and `gembs`, as well as the FASTQ files used to | ||
generate them, have been provided for practicing using `dupsifter` and can be found in the `example` directory of the | ||
`dupsifter` source code. Below is the results found in the output stats file for each run. All examples are created from | ||
hg38 without contigs. For test cases, any hg38 reference that conforms to the UCSC naming scheme (i.e., `chr1` not `1`) | ||
should work to recreate the results shown. | ||
|
||
The following tool versions were used for the test cases: | ||
|
||
| tool | version | | ||
|:-----------:|:--------:| | ||
| `biscuit` | `1.2.1` | | ||
| `bismark` | `0.24.0` | | ||
| `bsbolt` | `1.6.0` | | ||
| `bwa-meth` | `0.2.6` | | ||
| `gembs` | `4.0.4` | | ||
| `samtools` | `1.17` | | ||
| `dupsifter` | `1.2.0` | | ||
|
||
### Test FASTQ Creation | ||
|
||
FASTQ files were were generated with [Sherman](https://github.com/FelixKrueger/Sherman): | ||
```bash | ||
Sherman \ | ||
-l 150 \ | ||
-n 900 \ | ||
--CG_conversion 70 \ | ||
--CH_conversion 1 \ | ||
--genome_folder hg38_noContig \ | ||
--paired_end \ | ||
--bwa_ending | ||
``` | ||
|
||
100 duplicates were then randomly selected using Python (random seed set to 2023) and appended to the FASTQ files | ||
created by Sherman. This produced a FASTQ with a duplicate rate of 10% (100 read pairs out of 1000). | ||
|
||
### BISCUIT | ||
```default | ||
$ dupsifter \ | ||
$ -o biscuit.unsorted.markdup.bam \ | ||
$ -O biscuit.dupsifter.stats \ | ||
$ hg38_noContig.fa \ | ||
$ biscuit.unsorted.bam | ||
$ | ||
$ cat biscuit.dupsifter.stats | ||
[dupsifter] processing mode: paired-end | ||
[dupsifter] number of individual reads processed: 2000 | ||
[dupsifter] number of reads with both reads mapped: 1000 | ||
[dupsifter] number of reads with only one read mapped to the forward strand: 0 | ||
[dupsifter] number of reads with only one read mapped to the reverse strand: 0 | ||
[dupsifter] number of reads with both reads marked as duplicates: 97 | ||
[dupsifter] number of reads on the forward strand marked as duplicates: 0 | ||
[dupsifter] number of reads on the reverse strand marked as duplicates: 0 | ||
[dupsifter] number of individual primary-alignment reads: 2000 | ||
[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 | ||
[dupsifter] number of reads with no reads mapped: 0 | ||
[dupsifter] number of reads with no primary reads: 0 | ||
``` | ||
|
||
To briefly highlight one of the pitfalls of marking duplicates, the reason there are not the expected 100 duplicates | ||
from the `biscuit` data (similar in `bismark`, `bsbolt`, and `bwa-meth`) is due to the aligner choosing different | ||
mapping locations for reads with multiple, equally likely potential locations. An example is shown here: | ||
```default | ||
103_chr3:92733152-92733265 99 chr3 92815142 0 150M = 92815106 114 (remainder not shown) | ||
103_chr3:92733152-92733265 147 chr3 92815106 0 150M = 92815142 114 (remainder not shown) | ||
dup_103_chr3:92733152-92733265 99 chr3 93567084 0 150M = 93567048 114 (remainder not shown) | ||
dup_103_chr3:92733152-92733265 147 chr3 93567048 0 150M = 93567084 114 (remainder not shown) | ||
``` | ||
|
||
### Bismark | ||
```default | ||
$ dupsifter \ | ||
$ -o bismark.unsorted.markdup.bam \ | ||
$ -O bismark.dupsifter.stats \ | ||
$ hg38_noContig.fa \ | ||
$ bismark.unsorted.bam | ||
$ | ||
$ cat bismark.dupsifter.stats | ||
[dupsifter] processing mode: paired-end | ||
[dupsifter] number of individual reads processed: 1924 | ||
[dupsifter] number of reads with both reads mapped: 962 | ||
[dupsifter] number of reads with only one read mapped to the forward strand: 0 | ||
[dupsifter] number of reads with only one read mapped to the reverse strand: 0 | ||
[dupsifter] number of reads with both reads marked as duplicates: 96 | ||
[dupsifter] number of reads on the forward strand marked as duplicates: 0 | ||
[dupsifter] number of reads on the reverse strand marked as duplicates: 0 | ||
[dupsifter] number of individual primary-alignment reads: 1924 | ||
[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 | ||
[dupsifter] number of reads with no reads mapped: 0 | ||
[dupsifter] number of reads with no primary reads: 0 | ||
``` | ||
|
||
During alignment, `bismark` splits the FASTQ into chunks when aligning with more than one thread. The aligned reads are | ||
then merged into one output BAM; however, the read order is not necessarily the same as in the FASTQ. This provides an | ||
opportunity to highlight `dupsifter` choosing the first read found with a given signature as the "non-duplicate." In | ||
`read1.fastq`, all reads that were selected as duplicates of the simulated reads had `dup_` prepended to the name and | ||
were placed at the end of the FASTQ. However, during alignment, some of the reads ended up before the original read in | ||
the BAM. Below is an example of the original read being marked as a duplicate. | ||
```default | ||
dup_80_chr16:52422066-52422154/1 99 chr16 52422066 42 150M = 52422005 211 (remainder not shown) | ||
dup_80_chr16:52422066-52422154/1 147 chr16 52422005 42 150M = 52422066 -211 (remainder not shown) | ||
80_chr16:52422066-52422154/1 1123 chr16 52422066 42 150M = 52422005 211 (remainder not shown) | ||
80_chr16:52422066-52422154/1 1171 chr16 52422005 42 150M = 52422066 -211 (remainder not shown) | ||
``` | ||
|
||
Note, while in this simulated example, there is a known order to the reads, in real experiments the original template | ||
strand that PCR duplicates were created from cannot be distinguished, so choosing the first appearance of a given | ||
signature is a reasonable choice to make for the "non-duplicate." | ||
|
||
### BSBolt | ||
```default | ||
$ dupsifter \ | ||
$ -o bsbolt.unsorted.markdup.bam \ | ||
$ -O bsbolt.dupsifter.stats \ | ||
$ hg38_noContig.fa \ | ||
$ bsbolt.unsorted.bam | ||
$ | ||
$ cat bsbolt.dupsifter.stats | ||
[dupsifter] processing mode: paired-end | ||
[dupsifter] number of individual reads processed: 2000 | ||
[dupsifter] number of reads with both reads mapped: 977 | ||
[dupsifter] number of reads with only one read mapped to the forward strand: 0 | ||
[dupsifter] number of reads with only one read mapped to the reverse strand: 0 | ||
[dupsifter] number of reads with both reads marked as duplicates: 97 | ||
[dupsifter] number of reads on the forward strand marked as duplicates: 0 | ||
[dupsifter] number of reads on the reverse strand marked as duplicates: 0 | ||
[dupsifter] number of individual primary-alignment reads: 2000 | ||
[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 | ||
[dupsifter] number of reads with no reads mapped: 23 | ||
[dupsifter] number of reads with no primary reads: 0 | ||
``` | ||
|
||
### bwa-meth | ||
```default | ||
$ dupsifter \ | ||
$ -o bwameth.unsorted.markdup.bam \ | ||
$ -O bwameth.dupsifter.stats \ | ||
$ hg38_noContig.fa \ | ||
$ bwameth.unsorted.bam | ||
$ | ||
$ cat bwameth.dupsifter.stats | ||
[dupsifter] processing mode: paired-end | ||
[dupsifter] number of individual reads processed: 2000 | ||
[dupsifter] number of reads with both reads mapped: 1000 | ||
[dupsifter] number of reads with only one read mapped to the forward strand: 0 | ||
[dupsifter] number of reads with only one read mapped to the reverse strand: 0 | ||
[dupsifter] number of reads with both reads marked as duplicates: 97 | ||
[dupsifter] number of reads on the forward strand marked as duplicates: 0 | ||
[dupsifter] number of reads on the reverse strand marked as duplicates: 0 | ||
[dupsifter] number of individual primary-alignment reads: 2000 | ||
[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 | ||
[dupsifter] number of reads with no reads mapped: 0 | ||
[dupsifter] number of reads with no primary reads: 0 | ||
``` | ||
|
||
### gemBS | ||
```default | ||
$ samtools sort -n -o gembs.name_sorted.bam gembs.position_sorted.bam | ||
$ | ||
$ dupsifter \ | ||
$ -o gembs.name_sorted.markdup.bam \ | ||
$ -O gembs.dupsifter.stats \ | ||
$ hg38_noContig.fa \ | ||
$ gembs.name_sorted.bam | ||
$ | ||
$ cat gembs.dupsifter.stats | ||
[dupsifter] processing mode: paired-end | ||
[dupsifter] number of individual reads processed: 2000 | ||
[dupsifter] number of reads with both reads mapped: 1000 | ||
[dupsifter] number of reads with only one read mapped to the forward strand: 0 | ||
[dupsifter] number of reads with only one read mapped to the reverse strand: 0 | ||
[dupsifter] number of reads with both reads marked as duplicates: 100 | ||
[dupsifter] number of reads on the forward strand marked as duplicates: 0 | ||
[dupsifter] number of reads on the reverse strand marked as duplicates: 0 | ||
[dupsifter] number of individual primary-alignment reads: 2000 | ||
[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 | ||
[dupsifter] number of reads with no reads mapped: 0 | ||
[dupsifter] number of reads with no primary reads: 0 | ||
``` |