diff --git a/docs/examples.qmd b/docs/examples.qmd index 96a35d9..0c7855d 100644 --- a/docs/examples.qmd +++ b/docs/examples.qmd @@ -1,3 +1,325 @@ --- title: "Examples" --- + +Below you will find various examples of how to use `dupsifter`. Example datasets can be found in the `example` directory +of the `dupsifter` source code. + +## Basic Usage + +Input to `dupsifter` comes from either the terminal's standard input (i.e., streamed input) or with an input BAM file. +The default output is to stream the output, but it can also be written straight to an output file (`-o`/`--output`). + +Tools such as `biscuit` and `bwa-meth` stream their output. Examples of reading streamed input are: +```bash +# Streamed input and streamed output with BISCUIT +biscuit align -@ 8 -b 1 hg38.fa read1.fastq read2.fastq | \ +dupsifter hg38.fa | \ +samtools sort -@ 2 -o biscuit.sorted.markdup.bam - + +# Streamed input and output BAM with bwa-meth +bwameth.py \ + --threads 8 \ + --reference hg38.fa \ + read1.fastq \ + read2.fastq | \ +dupsifter -o bwameth.unsorted.markdup.bam hg38.fa +``` + +On the other hand, tools such as `bismark` and `bsbolt` default to writing a BAM file as output from their alignment. +Below is an example of using an input BAM and writing to an output BAM or streaming the output to another tool: +```bash +# Input BAM and output BAM with Bismark +bismark \ + --parallel 2 \ + --genome hg38 \ + -1 read1.fastq \ + -2 read2.fastq + +dupsifter -o bismark.unsorted.markdup.bam hg38.fa read1_bismark_bt2_pe.bam + +# Input BAM and streamed output with BSBolt +bsbolt Align \ + -t 8 \ + -DB hg38 \ + -O bsbolt.unsorted \ + -F1 read1.fastq \ + -F2 read2.fastq + +dupsifter hg38.fa bsbolt.unsorted.bam | \ +samtools sort -@ 2 -o bsbolt.sorted.markdup.bam +``` + +It should be noted that `dupsifter` expects it's input to be read-name grouped (i.e., reads with the same name next to +one another in the BAM). For `biscuit`, `bismark`, `bsbolt`, and `bwa-meth`, the reads are output in this way. However, +`gemBS` performs position sorting during its alignment. Therefore, if you want to use `dupsifter` with BAMs from +`gemBS`, then you will have to group the reads by name either with `samtools sort` or `samtools collate`. An example +with `samtools sort` is shown below: +```bash +# Alignment with gemBS +# gembs.hg38.conf and gembs.sample.conf are configuration files that are specified ahead of time +# The output BAM from gemBS in this example is gembs.position_sorted.bam +gemBS prepare -c gembs.hg38.conf -t gembs.sample.conf +gemBS map + +samtools sort -n -o gembs.name_sorted.bam gembs_position_sorted.bam + +dupsifter -o gembs.name_sorted.markdup.bam hg38.fa gembs.name_sorted.bam +``` + +Note, if you supply a position sorted BAM to `dupsifter`, it will produce an error message along the lines of: +```default +[dupsifter] ERROR: Can't find read 1 and/or read 2 in 1 reads with read ID: . +Are these reads coordinate sorted? +``` + +## Additional Usage + +### Single-end vs Paired-end + +`dupsifter` defaults to running for paired-end reads. In cases where single-end reads were aligned, include the +`-s`/`--single-end` option: +```bash +dupsifter --single-end -o ouput.bam hg38.fa input.bam +``` + +### Adding Mate Tags + +For aligners that don't generate mate tags (`MC` and `MQ`), `dupsifter` includes an option (`-m`/`--add-mate-tags`) for +adding mate tags during the duplicate marking stage: +```bash +dupsifter --add-mate-tags -o ouput.bam hg38.fa input.bam +``` + +Note, if a read already has one or both of the mate tags and `--add-mate-tags` is included, the pre-existing tags will +not be overwritten. + +### Duplicate Marking WGS Data + +While `dupsifter` was written for WGBS data, it can also be used to mark WGS data with the `-W`/`--wgs-only` option: +```bash +bwa mem hg38.fa read1.fq.gz read2.fq.gz | \ +dupsifter --wgs-only -o output.bam hg38.fa +``` + +### Adjusting the Maximum Read Length + +For long-read data, the default maximum read length may not be sufficient. In this case, you can increase the maximum +length with the `-l`/`--max-read-length` option: +```bash +dupsifter --max-read-length 20000 -o output.bam hg38.fa input.bam +``` + +Note, for current short-read technologies, the default value is more than sufficient. + +### Adjusting the Minimum Base Quality + +For cases where the bisulfite strand information is not included, `dupsifter` infers the strand based on the number of +C→T and G→A substitutions. By default, all bases in a read are used for counting the number of each +substitution. However, in the case of a user wanting to use bases with a higher base quality, you can adjust the minimum +base quality with `-b`/`--min-base-qual`: +```bash +dupsifter --min-base-qual 20 -o output.bam hg38.fa input.bam +``` + +Note, this option is only used when the bisulfite strand has to be inferred and is ignored otherwise. + +### Duplicate Marking for Reads with Cell Barcodes + +For experiments that include cell barcodes, `dupsifter` is able to include the cell barcode as part of its signature +with the `-B`/`--has-barcode` option: +```bash +dupsifter --has-barcode -o output.bam hg38.fa input.bam +``` + +For more details on how the cell barcode can be included in the read entry, see `dupsifter --help` or the +[Methodology](methodology.qmd) page. + +### Removing Duplicates + +Typically, downstream analysis tools will ignore reads flagged as duplicates that are left in the BAM file, but in cases +where duplicates reads must be removed, you can remove duplicates with the `-r`/`--remove-dups` option: +```bash +dupsifter --remove-dups -o output.bam hg38.fa input.bam +``` + +## Test Cases + +BAM files aligned with `biscuit`, `bismark`, `bsbolt`, `bwa-meth`, and `gembs`, as well as the FASTQ files used to +generate them, have been provided for practicing using `dupsifter` and can be found in the `example` directory of the +`dupsifter` source code. Below is the results found in the output stats file for each run. All examples are created from +hg38 without contigs. For test cases, any hg38 reference that conforms to the UCSC naming scheme (i.e., `chr1` not `1`) +should work to recreate the results shown. + +The following tool versions were used for the test cases: + +| tool | version | +|:-----------:|:--------:| +| `biscuit` | `1.2.1` | +| `bismark` | `0.24.0` | +| `bsbolt` | `1.6.0` | +| `bwa-meth` | `0.2.6` | +| `gembs` | `4.0.4` | +| `samtools` | `1.17` | +| `dupsifter` | `1.2.0` | + +### Test FASTQ Creation + +FASTQ files were were generated with [Sherman](https://github.com/FelixKrueger/Sherman): +```bash +Sherman \ + -l 150 \ + -n 900 \ + --CG_conversion 70 \ + --CH_conversion 1 \ + --genome_folder hg38_noContig \ + --paired_end \ + --bwa_ending +``` + +100 duplicates were then randomly selected using Python (random seed set to 2023) and appended to the FASTQ files +created by Sherman. This produced a FASTQ with a duplicate rate of 10% (100 read pairs out of 1000). + +### BISCUIT +```default +$ dupsifter \ +$ -o biscuit.unsorted.markdup.bam \ +$ -O biscuit.dupsifter.stats \ +$ hg38_noContig.fa \ +$ biscuit.unsorted.bam +$ +$ cat biscuit.dupsifter.stats +[dupsifter] processing mode: paired-end +[dupsifter] number of individual reads processed: 2000 +[dupsifter] number of reads with both reads mapped: 1000 +[dupsifter] number of reads with only one read mapped to the forward strand: 0 +[dupsifter] number of reads with only one read mapped to the reverse strand: 0 +[dupsifter] number of reads with both reads marked as duplicates: 97 +[dupsifter] number of reads on the forward strand marked as duplicates: 0 +[dupsifter] number of reads on the reverse strand marked as duplicates: 0 +[dupsifter] number of individual primary-alignment reads: 2000 +[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 +[dupsifter] number of reads with no reads mapped: 0 +[dupsifter] number of reads with no primary reads: 0 +``` + +To briefly highlight one of the pitfalls of marking duplicates, the reason there are not the expected 100 duplicates +from the `biscuit` data (similar in `bismark`, `bsbolt`, and `bwa-meth`) is due to the aligner choosing different +mapping locations for reads with multiple, equally likely potential locations. An example is shown here: +```default +103_chr3:92733152-92733265 99 chr3 92815142 0 150M = 92815106 114 (remainder not shown) +103_chr3:92733152-92733265 147 chr3 92815106 0 150M = 92815142 114 (remainder not shown) +dup_103_chr3:92733152-92733265 99 chr3 93567084 0 150M = 93567048 114 (remainder not shown) +dup_103_chr3:92733152-92733265 147 chr3 93567048 0 150M = 93567084 114 (remainder not shown) +``` + +### Bismark +```default +$ dupsifter \ +$ -o bismark.unsorted.markdup.bam \ +$ -O bismark.dupsifter.stats \ +$ hg38_noContig.fa \ +$ bismark.unsorted.bam +$ +$ cat bismark.dupsifter.stats +[dupsifter] processing mode: paired-end +[dupsifter] number of individual reads processed: 1924 +[dupsifter] number of reads with both reads mapped: 962 +[dupsifter] number of reads with only one read mapped to the forward strand: 0 +[dupsifter] number of reads with only one read mapped to the reverse strand: 0 +[dupsifter] number of reads with both reads marked as duplicates: 96 +[dupsifter] number of reads on the forward strand marked as duplicates: 0 +[dupsifter] number of reads on the reverse strand marked as duplicates: 0 +[dupsifter] number of individual primary-alignment reads: 1924 +[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 +[dupsifter] number of reads with no reads mapped: 0 +[dupsifter] number of reads with no primary reads: 0 +``` + +During alignment, `bismark` splits the FASTQ into chunks when aligning with more than one thread. The aligned reads are +then merged into one output BAM; however, the read order is not necessarily the same as in the FASTQ. This provides an +opportunity to highlight `dupsifter` choosing the first read found with a given signature as the "non-duplicate." In +`read1.fastq`, all reads that were selected as duplicates of the simulated reads had `dup_` prepended to the name and +were placed at the end of the FASTQ. However, during alignment, some of the reads ended up before the original read in +the BAM. Below is an example of the original read being marked as a duplicate. +```default +dup_80_chr16:52422066-52422154/1 99 chr16 52422066 42 150M = 52422005 211 (remainder not shown) +dup_80_chr16:52422066-52422154/1 147 chr16 52422005 42 150M = 52422066 -211 (remainder not shown) +80_chr16:52422066-52422154/1 1123 chr16 52422066 42 150M = 52422005 211 (remainder not shown) +80_chr16:52422066-52422154/1 1171 chr16 52422005 42 150M = 52422066 -211 (remainder not shown) +``` + +Note, while in this simulated example, there is a known order to the reads, in real experiments the original template +strand that PCR duplicates were created from cannot be distinguished, so choosing the first appearance of a given +signature is a reasonable choice to make for the "non-duplicate." + +### BSBolt +```default +$ dupsifter \ +$ -o bsbolt.unsorted.markdup.bam \ +$ -O bsbolt.dupsifter.stats \ +$ hg38_noContig.fa \ +$ bsbolt.unsorted.bam +$ +$ cat bsbolt.dupsifter.stats +[dupsifter] processing mode: paired-end +[dupsifter] number of individual reads processed: 2000 +[dupsifter] number of reads with both reads mapped: 977 +[dupsifter] number of reads with only one read mapped to the forward strand: 0 +[dupsifter] number of reads with only one read mapped to the reverse strand: 0 +[dupsifter] number of reads with both reads marked as duplicates: 97 +[dupsifter] number of reads on the forward strand marked as duplicates: 0 +[dupsifter] number of reads on the reverse strand marked as duplicates: 0 +[dupsifter] number of individual primary-alignment reads: 2000 +[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 +[dupsifter] number of reads with no reads mapped: 23 +[dupsifter] number of reads with no primary reads: 0 +``` + +### bwa-meth +```default +$ dupsifter \ +$ -o bwameth.unsorted.markdup.bam \ +$ -O bwameth.dupsifter.stats \ +$ hg38_noContig.fa \ +$ bwameth.unsorted.bam +$ +$ cat bwameth.dupsifter.stats +[dupsifter] processing mode: paired-end +[dupsifter] number of individual reads processed: 2000 +[dupsifter] number of reads with both reads mapped: 1000 +[dupsifter] number of reads with only one read mapped to the forward strand: 0 +[dupsifter] number of reads with only one read mapped to the reverse strand: 0 +[dupsifter] number of reads with both reads marked as duplicates: 97 +[dupsifter] number of reads on the forward strand marked as duplicates: 0 +[dupsifter] number of reads on the reverse strand marked as duplicates: 0 +[dupsifter] number of individual primary-alignment reads: 2000 +[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 +[dupsifter] number of reads with no reads mapped: 0 +[dupsifter] number of reads with no primary reads: 0 +``` + +### gemBS +```default +$ samtools sort -n -o gembs.name_sorted.bam gembs.position_sorted.bam +$ +$ dupsifter \ +$ -o gembs.name_sorted.markdup.bam \ +$ -O gembs.dupsifter.stats \ +$ hg38_noContig.fa \ +$ gembs.name_sorted.bam +$ +$ cat gembs.dupsifter.stats +[dupsifter] processing mode: paired-end +[dupsifter] number of individual reads processed: 2000 +[dupsifter] number of reads with both reads mapped: 1000 +[dupsifter] number of reads with only one read mapped to the forward strand: 0 +[dupsifter] number of reads with only one read mapped to the reverse strand: 0 +[dupsifter] number of reads with both reads marked as duplicates: 100 +[dupsifter] number of reads on the forward strand marked as duplicates: 0 +[dupsifter] number of reads on the reverse strand marked as duplicates: 0 +[dupsifter] number of individual primary-alignment reads: 2000 +[dupsifter] number of individual secondary- and supplementary-alignment reads: 0 +[dupsifter] number of reads with no reads mapped: 0 +[dupsifter] number of reads with no primary reads: 0 +```