Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipes #91

Merged
merged 16 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion CHANGELOG.md
muffato marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,24 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[1.2.2](https://github.com/sanger-tol/readmapping/releases/tag/1.2.2)] - Norwegian Ridgeback (patch 2) -[2024-05-23]
## [[1.3.0](https://github.com/sanger-tol/readmapping/releases/tag/1.3.0)] - Antipodean Opaleye - [2024-06-XX]

### Enhancements & fixes

- Combined steps to improve the efficiency of the pipeline, especially on large genomes
- "crumble" is now run on _every_ data type, not just PacBio

### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.

| Dependency | Old version | New version |
| ---------- | ------------- | ------------- |
| `samtools` | 1.14 and 1.17 | 1.17 and 1.18 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.

## [[1.2.2](https://github.com/sanger-tol/readmapping/releases/tag/1.2.2)] - Norwegian Ridgeback (patch 2) - [2024-05-23]

### Enhancements & fixes

Expand Down
41 changes: 17 additions & 24 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,35 +20,28 @@ process {
memory = { check_max( 50.MB * task.attempt, 'memory' ) }
time = { check_max( 30.min * task.attempt, 'time' ) }

withName: 'SAMTOOLS_(CONVERT|FILTER)' {
withName: 'SAMTOOLS_(CONVERT)' {
time = { check_max( 1.hour * task.attempt, 'time' ) }
}

withName: 'SAMTOOLS_(FASTA)' {
time = { check_max( 2.hour * task.attempt, 'time' ) }
}

withName: 'SAMTOOLS_(STATS)' {
// Actually less than 1 hour for PacBio HiFi data, but confirmed 3 hours for Hi-C
time = { check_max( 4.hour * task.attempt, 'time' ) }
}

withName: 'SAMTOOLS_(COLLATE|FASTQ|FIXMATE|FLAGSTAT|MARKDUP|MERGE|SORT|VIEW)' {
withName: 'SAMTOOLS_(COLLATETOFASTA|FILTERTOFASTQ|FIXMATE|FLAGSTAT|MARKDUP|MERGE|VIEW)' {
time = { check_max( 8.hour * task.attempt, 'time' ) }
}

withName: 'SAMTOOLS_(FLAGSTAT|IDXSTATS)' {
memory = { check_max( 250.MB * task.attempt, 'memory' ) }
}

withName: '.*:ALIGN_(HIFI|HIC|ILLUMINA):.*:SAMTOOLS_(STATS|VIEW)' {
memory = { check_max( 1.GB * task.attempt, 'memory' ) }
}
withName: '.*:ALIGN_(CLR|ONT):.*:SAMTOOLS_(STATS|VIEW)' {
memory = { check_max( 2.GB * task.attempt, 'memory' ) }
withName: 'SAMTOOLS_(STATS|VIEW)' {
memory = { check_max( ((meta.datatype == "pacbio_clr" || meta.datatype == "ont") ? 2.GB : 1.GB) * task.attempt, 'memory' ) }
}

withName: '.*:FILTER_PACBIO:SAMTOOLS_COLLATE' {
withName: 'SAMTOOLS_COLLATETOFASTA' {
cpus = { log_increase_cpus(4, 2*task.attempt, 1, 2) }
memory = { check_max( 1.GB * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'memory' ) }
}
Expand All @@ -59,13 +52,6 @@ process {
time = { check_max( 2.h * Math.ceil( meta.read_count / 100000000 ) * task.attempt / log_increase_cpus(2, 6*task.attempt, 1, 2), 'time' ) }
}

withName: SAMTOOLS_SORT {
cpus = { log_increase_cpus(4, 2*task.attempt, 1, 2) }
// Memory increases by 768M for each thread
memory = { check_max( 1.GB + 800.MB * log_increase_cpus(4, 2*task.attempt, 1, 2), 'memory' ) }
time = { check_max( 8.hour * Math.ceil( meta.read_count / 1000000000 ) * task.attempt, 'time' ) }
}

withName: BLAST_BLASTN {
time = { check_max( 2.hour * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
memory = { check_max( 100.MB + 20.MB * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'memory' ) }
Expand All @@ -84,17 +70,24 @@ process {
// Runtime for 1 billion reads on 12 threads is a function of the logarithm of the genome size
// Runtime is considered proportional to the number of reads and inversely to number of threads
time = { check_max( 3.h * task.attempt * Math.ceil(positive_log(meta2.genome_size/100000, 10)) * Math.ceil(meta.read_count/1000000000) * 12 / log_increase_cpus(6, 6*task.attempt, meta.read_count/1000000000, 2), 'time' ) }
// Base RAM usage is about 6 times the genome size. Each thread takes an additional 800 MB RAM
// Memory usage of SAMTOOLS_VIEW is negligible.
memory = { check_max( 6.GB * Math.ceil(meta2.genome_size / 1000000000) + 800.MB * task.attempt * log_increase_cpus(6, 6*task.attempt, meta.read_count/1000000000, 2), 'memory' ) }
// Base RAM usage is about 6 times the genome size.
// Each thread takes an additional 800 MB RAM for bwa-mem2 and 800 MB for samtools sort
memory = { check_max( 8.GB + 6.GB * Math.ceil(meta2.genome_size / 1000000000) + 1600.MB * task.attempt * log_increase_cpus(6, 6*task.attempt, meta.read_count/1000000000, 2), 'memory' ) }
}

withName: MINIMAP2_ALIGN {
withName: '.*:ALIGN_HIFI:MINIMAP2_ALIGN' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( (6.GB * Math.ceil( reference.size() / 1000000000 ) + 4.GB * Math.ceil( meta.read_count / 1000000 )) * task.attempt, 'memory' ) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 14.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 3.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

// Extrapolated from the HIFI settings on the basis of 1 ONT alignment. CLR assumed to behave the same way as ONT
withName: '.*:ALIGN_(CLR|ONT):MINIMAP2_ALIGN' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 30.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 1.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

withName: CRUMBLE {
// No correlation between memory usage and the number of reads or the genome size.
// Most genomes seem happy with 1 GB, then some with 2 GB, then some with 5 GB.
Expand Down
64 changes: 8 additions & 56 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,13 @@ process {
ext.args = { "-R ${meta.read_group}" }
}

withName: SAMTOOLS_SORT {
ext.prefix = { "${meta.id}.sort" }
}

withName: SAMTOOLS_MERGE {
ext.args = { "-c -p" }
ext.prefix = { "${meta.id}.merge" }
}

withName: SAMTOOLS_COLLATE {
withName: SAMTOOLS_COLLATETOFASTA {
ext.args = { (params.use_work_dir_as_temp ? "-T." : "") }
ext.prefix = { "${meta.id}.collate" }
}

withName: BLAST_BLASTN {
Expand All @@ -45,27 +40,21 @@ process {
ext.args = "-be '[rq]>=0.99' -x fi -x fp -x ri -x rp --write-index"
}

withName: SAMTOOLS_FILTER {
ext.prefix = { "${meta.id}.filter" }
}

withName: '.*:.*:ALIGN_HIFI:MINIMAP2_ALIGN' {
// minimap2 2.24 can only work with genomes up to 4 Gbp. For larger genomes, add the -I option with the genome size in Gbp.
// In fact, we can also use -I to *decrease* the memory requirements for smaller genomes
// NOTE: minimap2 uses the decimal system ! 1G = 1,000,000,000 bp
// NOTE: Math.ceil returns a double, but fortunately minimap2 accepts floating point values.
// NOTE: minimap2 2.25 raises the default to 8G, which means higher memory savings on smaller genomes
// NOTE: Use `reference.size()` for now, and switch to `meta2.genome_size` once we update the modules.
// ext.args = { "-ax map-hifi --cs=short -R ${meta.read_group} -I" + Math.ceil(meta.genome_size/1e9) + 'G' }
ext.args = { "-ax map-hifi --cs=short -R ${meta.read_group} -I" + Math.ceil(reference.size()/1e9) + 'G' }
withName: '.*:.*:ALIGN_HIFI:MINIMAP2_ALIGN' {
ext.args = { "-ax map-hifi --cs=short -R ${meta.read_group} -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: '.*:.*:ALIGN_CLR:MINIMAP2_ALIGN' {
ext.args = { "-ax map-pb -R ${meta.read_group}" }
ext.args = { "-ax map-pb -R ${meta.read_group} -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: '.*:.*:ALIGN_ONT:MINIMAP2_ALIGN' {
ext.args = { "-ax map-ont -R ${meta.read_group}" }
ext.args = { "-ax map-ont -R ${meta.read_group} -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: '.*:CONVERT_STATS:SAMTOOLS_VIEW' {
Expand All @@ -87,12 +76,7 @@ process {

withName: CRUMBLE {
ext.prefix = { "${input.baseName}.crumble" }
ext.args = '-y pbccs -O cram'
publishDir = [
path: { "${params.outdir}/read_mapping/pacbio" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
ext.args = { (meta.datatype == "pacbio" ? "-y pbccs " : "") + "-O bam" }
}

withName: SAMPLESHEET_CHECK {
Expand All @@ -103,41 +87,9 @@ process {
]
}

withName: '.*:ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:.*' {
publishDir = [
path: { "${params.outdir}/read_mapping/hic" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*:ALIGN_ILLUMINA:MARKDUP_STATS:CONVERT_STATS:.*' {
publishDir = [
path: { "${params.outdir}/read_mapping/illumina" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*:ALIGN_HIFI:CONVERT_STATS:.*' {
publishDir = [
path: { "${params.outdir}/read_mapping/pacbio" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*:ALIGN_CLR:CONVERT_STATS:.*' {
publishDir = [
path: { "${params.outdir}/read_mapping/pacbio" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*:ALIGN_ONT:CONVERT_STATS:.*' {
withName: '.*:CONVERT_STATS:SAMTOOLS_.*' {
publishDir = [
path: { "${params.outdir}/read_mapping/ont" },
path: { "${params.outdir}/read_mapping/${meta.datatype}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
Expand Down
23 changes: 4 additions & 19 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,7 @@
"crumble": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"],
"patch": "modules/nf-core/crumble/crumble.diff"
"installed_by": ["modules"]
},
"custom/dumpsoftwareversions": {
"branch": "master",
Expand All @@ -38,24 +37,14 @@
},
"minimap2/align": {
"branch": "master",
"git_sha": "603ecbd9f45300c9788f197d2a15a005685b4220",
"installed_by": ["modules"]
},
"samtools/collate": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"git_sha": "efbf86bb487f288ac30660282709d9620dd6048e",
"installed_by": ["modules"]
},
"samtools/faidx": {
"branch": "master",
"git_sha": "fd742419940e01ba1c5ecb172c3e32ec840662fe",
"installed_by": ["modules"]
},
"samtools/fasta": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"samtools/fastq": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
Expand All @@ -74,12 +63,8 @@
"samtools/merge": {
"branch": "master",
"git_sha": "0460d316170f75f323111b4a2c0a2989f0c32013",
"installed_by": ["modules"]
},
"samtools/sort": {
"branch": "master",
"git_sha": "a0f7be95788366c1923171e358da7d049eb440f9",
"installed_by": ["modules"]
"installed_by": ["modules"],
"patch": "modules/nf-core/samtools/merge/samtools-merge.diff"
},
"samtools/stats": {
"branch": "master",
Expand Down
45 changes: 45 additions & 0 deletions modules/local/samtools_collatetofasta.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
process SAMTOOLS_COLLATETOFASTA {
tag "$meta.id"
label 'process_medium'

conda "bioconda::samtools=1.17"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/samtools:1.17--h00cdaf9_0' :
'biocontainers/samtools:1.17--h00cdaf9_0' }"

input:
tuple val(meta), path(input)

output:
tuple val(meta), path("*.fasta"), emit: fasta
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def args2 = task.ext.args2 ?: ''

def prefix = task.ext.prefix ?: "${meta.id}"
"""
samtools collate \\
$args \\
-O \\
-u \\
-T ${prefix}.collate \\
--threads $task.cpus \\
${input} \\
| \\
samtools fasta \\
$args2 \\
--threads $task.cpus \\
-0 ${prefix}.fasta \\
> /dev/null

cat <<-END_VERSIONS > versions.yml
"${task.process}":
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
Original file line number Diff line number Diff line change
@@ -1,34 +1,42 @@
process SAMTOOLS_SORT {
process SAMTOOLS_FILTERTOFASTQ {
tag "$meta.id"
label 'process_medium'
label 'process_low'

conda "bioconda::samtools=1.17"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/samtools:1.17--h00cdaf9_0' :
'biocontainers/samtools:1.17--h00cdaf9_0' }"

input:
tuple val(meta), path(bam)
tuple val(meta), path(input), path(index)
path qname

output:
tuple val(meta), path("*.bam"), emit: bam
tuple val(meta), path("*.csi"), emit: csi, optional: true
path "versions.yml" , emit: versions
tuple val(meta), path("*.fastq.gz") , emit: fastq
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def args2 = task.ext.args2 ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
if ("$bam" == "${prefix}.bam") error "Input and output names are the same, use \"task.ext.prefix\" to disambiguate!"
"""
samtools sort \\
samtools view \\
--threads $task.cpus \\
--qname-file ${qname} \\
--unoutput - \\
$args \\
-@ $task.cpus \\
-o ${prefix}.bam \\
-T $prefix \\
$bam
-o /dev/null \\
$input \\
| \\
samtools fastq \\
$args2 \\
--threads $task.cpus \\
-0 ${prefix}.fastq.gz \\
- \\
> /dev/null

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand All @@ -39,7 +47,7 @@ process SAMTOOLS_SORT {
stub:
def prefix = task.ext.prefix ?: "${meta.id}"
"""
touch ${prefix}.bam
echo | gzip > ${prefix}.fastq.gz

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
Loading
Loading