Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource optimisation #82

Merged
merged 50 commits into from
Dec 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
2109bed
Collect the read counts from the input files
muffato Oct 31, 2023
e2e8ead
Collect the size of the genome (file size is a good proxy)
muffato Oct 31, 2023
11678fc
Simply use the whole base name to name the channel
muffato Oct 31, 2023
84caa10
Deal with genomes > 4 Gbp
muffato Oct 31, 2023
8dca0f3
Update the read count for SAMTOOLS_MERGE
muffato Nov 2, 2023
8c67daf
Some values are large and need 64 bits
muffato Nov 2, 2023
e432d89
Wrong information
muffato Nov 2, 2023
ae6942b
Updated the error codes to the latest template versions. Covers all L…
muffato Nov 2, 2023
46e9910
Estimate the resource requirements based on the size of the inputs
muffato Nov 2, 2023
5f2d786
bugfix: minimap2 uses the decimal system and understands floating poi…
muffato Nov 6, 2023
35a646d
The output of SAMTOOLS_MERGE is sorted
muffato Nov 6, 2023
04b2660
Logically, SAMTOOLS_MERGE should happen in the calling sub-workflow
muffato Nov 6, 2023
f9b2763
Skip SAMTOOLS_MERGE if there is a single file
muffato Nov 6, 2023
fb1cd6d
Explain why there is no SAMTOOLS_SORT
muffato Nov 6, 2023
c00eee6
Replaced the markduplicate workflow by a single module / bash pipeline
muffato Nov 6, 2023
5c39de5
The runtime of samtools sort depends on the number of reads
muffato Nov 6, 2023
ff9ff03
Updated requirements for SAMTOOLS_SORMADUP
muffato Nov 6, 2023
d54d46f
The MINIMAP2_ALIGN includes SAMTOOLS_SORT. Need some extra memory rel…
muffato Nov 6, 2023
582bedd
In my latest tests, it seems BWAMEM2_MEM memory usage is correlated w…
muffato Nov 6, 2023
d884d47
I don't need samtools cat
muffato Nov 6, 2023
a581ae1
Alignment is nice
muffato Nov 6, 2023
d141493
Also use -I to decrease the memory requirement of MINIMAP2_ALIGN
muffato Nov 8, 2023
613c4cb
20 minutes seems a bit too close to the real usage. We may want a lar…
muffato Nov 8, 2023
bc844bc
Don't increase the number of CPUs too high as there are diminishing r…
muffato Nov 8, 2023
ec695e6
Some samtools commands have a fixed memory usage per thread, so inclu…
muffato Nov 8, 2023
d8043a5
There was supposed to be a +1 there, to provide the "* task.attempt" …
muffato Nov 9, 2023
647a748
Introduced a helper method that clearly shows how the logarithm is mo…
muffato Nov 9, 2023
d0b305a
Increasing the number of attempts to give more resilience
muffato Nov 9, 2023
2465d74
New formula that is less greedy
muffato Nov 9, 2023
a9ec952
Adjusted resource requirements for SAMTOOLS_SORMADUP
muffato Nov 9, 2023
a249a76
Added a note about minimap2
muffato Nov 10, 2023
99d289d
Added some credit
muffato Nov 10, 2023
9afa763
More consistent comment
muffato Nov 10, 2023
eae3261
Tell it's a meta map
muffato Nov 10, 2023
f211fb4
Indentation should be a multiple of 4
muffato Nov 10, 2023
5b70604
SAMTOOLS_FLAGSTAT may take more time
muffato Nov 10, 2023
0c5e4f9
typo
muffato Nov 16, 2023
c27dd30
Increased runtime, just in case
muffato Nov 16, 2023
a2924b0
Updated runtime and memory requirements
muffato Nov 21, 2023
b2f22a2
Need these fields in the debug output
muffato Nov 27, 2023
cfac7c8
Like in the genome note pipeline, use the work directory instead of /tmp
muffato Nov 27, 2023
9da9f3f
Updated the settings for SORMADUP
muffato Nov 29, 2023
5e15589
Updated the BWA_MEM memory requirement
muffato Nov 29, 2023
bdbcbb4
Increased the number of retries
muffato Nov 29, 2023
3905219
The default memory settings work just fine and make things easier to …
muffato Dec 5, 2023
c15d923
Usage is very close to the trend line. Smaller bins work fine
muffato Dec 5, 2023
4094724
quay.io/ is now the default
muffato Dec 7, 2023
3ef0477
Added optimised settings for crumble
muffato Dec 8, 2023
f0c1ba4
Need to fake REF_PATH to force crumble to use the Fasta file defined …
muffato Dec 8, 2023
79bcbbc
There is a difference for ONT, which I assume would be there for CLR too
muffato Dec 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 114 additions & 43 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -2,64 +2,135 @@
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sanger-tol/readmapping Nextflow base config file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A 'blank slate' config file, appropriate for general use on most high performance
compute environments. Assumes that all software is installed and available on
the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
----------------------------------------------------------------------------------------
*/

process {
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Increasing the number of CPUs often gives diminishing returns, so we increase it
following a logarithm curve. Example:
- 0 < value <= 1: start + step
- 1 < value <= 2: start + 2*step
- 2 < value <= 4: start + 3*step
- 4 < value <= 8: start + 4*step
In order to support re-runs, the step increase may be multiplied by the attempt
number prior to calling this function.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
// Modified logarithm function that doesn't return negative numbers
def positive_log(value, base) {
if (value <= 1) {
return 0
} else {
return Math.log(value)/Math.log(base)
}
}

def log_increase_cpus(start, step, value, base) {
return check_max(start + step * (1 + Math.ceil(positive_log(value, base))), 'cpus')
}

errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
maxRetries = 1

process {

errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 5
maxErrors = '-1'

// Process-specific resource requirements
// NOTE - Please try and re-use the labels below as much as possible.
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
// In this configuration file, we give little resources by default and
// explicitly bump them up for some processes.
// All rules should still increase resources every attempt to allow the
// pipeline to self-heal from MEMLIMIT/RUNLIMIT.

// Default
cpus = 1
memory = { check_max( 50.MB * task.attempt, 'memory' ) }
time = { check_max( 30.min * task.attempt, 'time' ) }

withName: 'SAMTOOLS_(CONVERT|FILTER)' {
time = { check_max( 1.hour * task.attempt, 'time' ) }
}

withName: 'SAMTOOLS_(FASTA)' {
time = { check_max( 2.hour * task.attempt, 'time' ) }
}

withName: 'SAMTOOLS_(STATS)' {
// Actually less than 1 hour for PacBio HiFi data, but confirmed 3 hours for Hi-C
time = { check_max( 4.hour * task.attempt, 'time' ) }
}
withLabel:process_low {
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
memory = { check_max( 12.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }

withName: 'SAMTOOLS_(COLLATE|FASTQ|FIXMATE|FLAGSTAT|MARKDUP|MERGE|SORT|VIEW)' {
time = { check_max( 8.hour * task.attempt, 'time' ) }
}
withLabel:process_medium {
cpus = { check_max( 6 * task.attempt, 'cpus' ) }
memory = { check_max( 36.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }

withName: 'SAMTOOLS_(FLAGSTAT|IDXSTATS)' {
memory = { check_max( 250.MB * task.attempt, 'memory' ) }
}
withLabel:process_high {
cpus = { check_max( 12 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }

withName: '.*:ALIGN_(HIFI|HIC|ILLUMINA):.*:SAMTOOLS_(STATS|VIEW)' {
memory = { check_max( 1.GB * task.attempt, 'memory' ) }
}
withLabel:process_long {
time = { check_max( 20.h * task.attempt, 'time' ) }
withName: '.*:ALIGN_(CLR|ONT):.*:SAMTOOLS_(STATS|VIEW)' {
memory = { check_max( 2.GB * task.attempt, 'memory' ) }
}
withLabel:process_high_memory {
memory = { check_max( 200.GB * task.attempt, 'memory' ) }

withName: '.*:FILTER_PACBIO:SAMTOOLS_COLLATE' {
cpus = { log_increase_cpus(4, 2*task.attempt, 1, 2) }
memory = { check_max( 1.GB * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'memory' ) }
}
withLabel:error_ignore {
errorStrategy = 'ignore'

withName: 'SAMTOOLS_SORMADUP' {
cpus = { log_increase_cpus(2, 6*task.attempt, 1, 2) }
memory = { check_max( 10.GB + 0.6.GB * Math.ceil( meta.read_count / 100000000 ) * task.attempt, 'memory' ) }
time = { check_max( 2.h * Math.ceil( meta.read_count / 100000000 ) * task.attempt / log_increase_cpus(2, 6*task.attempt, 1, 2), 'time' ) }
}
withLabel:error_retry {
errorStrategy = 'retry'
maxRetries = 2

withName: SAMTOOLS_SORT {
cpus = { log_increase_cpus(4, 2*task.attempt, 1, 2) }
// Memory increases by 768M for each thread
memory = { check_max( 1.GB + 800.MB * log_increase_cpus(4, 2*task.attempt, 1, 2), 'memory' ) }
time = { check_max( 8.hour * Math.ceil( meta.read_count / 1000000000 ) * task.attempt, 'time' ) }
}
withName:BWAMEM2_INDEX {
memory = { check_max( 1.GB * Math.ceil( 28 * fasta.size() / 1000000000 ) * task.attempt, 'memory' ) }

withName: BLAST_BLASTN {
time = { check_max( 2.hour * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
memory = { check_max( 100.MB + 20.MB * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'memory' ) }
// The tool never seems to use more than 1 core even when given multiple. Sticking to 1 (the default)
}

withName: BWAMEM2_INDEX {
memory = { check_max( 24.GB * Math.ceil( meta.genome_size / 1000000000 ) * task.attempt, 'memory' ) }
time = { check_max( 30.min * Math.ceil( meta.genome_size / 1000000000 ) * task.attempt, 'time' ) }
// Not multithreaded
}

withName: BWAMEM2_MEM {
// Corresponds to 12 threads as the minimum, 24 threads if 3 billion reads
cpus = { log_increase_cpus(6, 6*task.attempt, meta.read_count/1000000000, 2) }
// Runtime for 1 billion reads on 12 threads is a function of the logarithm of the genome size
// Runtime is considered proportional to the number of reads and inversely to number of threads
time = { check_max( 3.h * task.attempt * Math.ceil(positive_log(meta2.genome_size/100000, 10)) * Math.ceil(meta.read_count/1000000000) * 12 / log_increase_cpus(6, 6*task.attempt, meta.read_count/1000000000, 2), 'time' ) }
// Base RAM usage is about 6 times the genome size. Each thread takes an additional 800 MB RAM
// Memory usage of SAMTOOLS_VIEW is negligible.
memory = { check_max( 6.GB * Math.ceil(meta2.genome_size / 1000000000) + 800.MB * log_increase_cpus(6, 6*task.attempt, meta.read_count/1000000000, 2), 'memory' ) }
}

withName: MINIMAP2_ALIGN {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( (6.GB * Math.ceil( reference.size() / 1000000000 ) + 4.GB * Math.ceil( meta.read_count / 1000000 )) * task.attempt, 'memory' ) }
time = { check_max( 3.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

withName: CRUMBLE {
// No correlation between memory usage and the number of reads or the genome size.
// Most genomes seem happy with 1 GB, then some with 2 GB, then some with 5 GB.
// The formula below tries to mimic that growth and relies on job retries being allowed.
memory = { check_max( task.attempt * (task.attempt + 1) * 512.MB, 'memory' ) }
// Slightly better correlation between runtime and the number of reads.
time = { check_max( 1.5.h + 1.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
Expand Down
19 changes: 9 additions & 10 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,10 @@ process {
}

withName: SAMTOOLS_COLLATE {
ext.args = { (params.use_work_dir_as_temp ? "-T." : "") }
ext.prefix = { "${meta.id}.collate" }
}

withName: SAMTOOLS_FIXMATE {
ext.args = '-m'
ext.prefix = { "${meta.id}.fixmate" }
}

withName: SAMTOOLS_MARKDUP {
ext.prefix = { "${meta.id}.markdup" }
}

withName: BLAST_BLASTN {
ext.args = '-task blastn -reward 1 -penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true -evalue .01 -searchsp 1750000000000 -outfmt 6'
}
Expand All @@ -58,7 +50,14 @@ process {
}

withName: '.*:.*:ALIGN_HIFI:MINIMAP2_ALIGN' {
ext.args = { "-ax map-hifi --cs=short -R ${meta.read_group}" }
// minimap2 2.24 can only work with genomes up to 4 Gbp. For larger genomes, add the -I option with the genome size in Gbp.
// In fact, we can also use -I to *decrease* the memory requirements for smaller genomes
// NOTE: minimap2 uses the decimal system ! 1G = 1,000,000,000 bp
// NOTE: Math.ceil returns a double, but fortunately minimap2 accepts floating point values.
// NOTE: minimap2 2.25 raises the default to 8G, which means higher memory savings on smaller genomes
// NOTE: Use `reference.size()` for now, and switch to `meta2.genome_size` once we update the modules.
// ext.args = { "-ax map-hifi --cs=short -R ${meta.read_group} -I" + Math.ceil(meta.genome_size/1e9) + 'G' }
ext.args = { "-ax map-hifi --cs=short -R ${meta.read_group} -I" + Math.ceil(reference.size()/1e9) + 'G' }
}

withName: '.*:.*:ALIGN_CLR:MINIMAP2_ALIGN' {
Expand Down
2 changes: 1 addition & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ PacBio reads generated using both CLR and CCS technology are filtered using `BLA

### Short reads

Short read data from HiC and Illumina technologies is aligned with `BWAMEM2_MEM`. The sorted and merged alignment files are processed using the `SAMTOOLS` markduplicate workflow. The mark duplicate alignments is output in the CRAM format, along with the index.
Short read data from HiC and Illumina technologies is aligned with `BWAMEM2_MEM`. The sorted and merged alignment files are processed using the `SAMTOOLS` [mark-duplicate workflow](https://www.htslib.org/algorithms/duplicate.html#workflow). The mark duplicate alignments is output in the CRAM format, along with the index.

<details markdown="1">
<summary>Output files</summary>
Expand Down
10 changes: 0 additions & 10 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,6 @@
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"samtools/fixmate": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"samtools/flagstat": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
Expand All @@ -76,11 +71,6 @@
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"samtools/markdup": {
"branch": "master",
"git_sha": "9e51255c4f8ec69fb6ccf68593392835f14fecb8",
"installed_by": ["modules"]
},
"samtools/merge": {
"branch": "master",
"git_sha": "0460d316170f75f323111b4a2c0a2989f0c32013",
Expand Down
2 changes: 1 addition & 1 deletion modules/local/pacbio_filter.nf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ process PACBIO_FILTER {
conda "conda-forge::gawk=5.1.0"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/gawk:5.1.0' :
'quay.io/biocontainers/gawk:5.1.0' }"
'biocontainers/gawk:5.1.0' }"

input:
tuple val(meta), path(txt)
Expand Down
2 changes: 1 addition & 1 deletion modules/local/samplesheet_check.nf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ process SAMPLESHEET_CHECK {
conda "conda-forge::python=3.8.3"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/python:3.8.3' :
'quay.io/biocontainers/python:3.8.3' }"
'biocontainers/python:3.8.3' }"

input:
path samplesheet
Expand Down
77 changes: 77 additions & 0 deletions modules/local/samtools_sormadup.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
// Copied from https://github.com/nf-core/modules/pull/3310
// Author: Matthias De Smet, https://github.com/matthdsm
process SAMTOOLS_SORMADUP {
tag "$meta.id"
label 'process_medium'

conda "bioconda::samtools=1.17"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/samtools:1.17--h00cdaf9_0' :
'biocontainers/samtools:1.17--h00cdaf9_0' }"

input:
tuple val(meta), path(input)
tuple val(meta2), path(fasta)

output:
tuple val(meta), path("*.{bam,cram}") , emit: bam
tuple val(meta), path("*.{bai,crai}") , optional:true, emit: bam_index
tuple val(meta), path("*.metrics") , emit: metrics
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def args2 = task.ext.args2 ?: ''
def args3 = task.ext.args3 ?: ''
def args4 = task.ext.args4 ?: ''

def prefix = task.ext.prefix ?: "${meta.id}"
def extension = args.contains("--output-fmt sam") ? "sam" :
args.contains("--output-fmt bam") ? "bam" :
args.contains("--output-fmt cram") ? "cram" :
"bam"
def reference = fasta ? "--reference ${fasta}" : ""

"""
samtools collate \\
$args \\
-O \\
-u \\
-T ${prefix}.collate \\
--threads $task.cpus \\
${reference} \\
${input} \\
- \\
| \\
samtools fixmate \\
$args2 \\
-m \\
-u \\
--threads $task.cpus \\
- \\
- \\
| \\
samtools sort \\
$args3 \\
-u \\
-T ${prefix}.sort \\
--threads $task.cpus \\
- \\
| \\
samtools markdup \\
-T ${prefix}.markdup \\
-f ${prefix}.metrics \\
--threads $task.cpus \\
$args4 \\
- \\
${prefix}.${extension}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
2 changes: 1 addition & 1 deletion modules/local/unmask.nf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ process UNMASK {
conda "conda-forge::gawk=5.1.0"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/gawk:5.1.0' :
'quay.io/biocontainers/gawk:5.1.0' }"
'biocontainers/gawk:5.1.0' }"

input:
tuple val(meta), path(fasta)
Expand Down
9 changes: 8 additions & 1 deletion modules/nf-core/crumble/crumble.diff

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions modules/nf-core/crumble/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading