-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource optimisation #92
Conversation
…ne [not tested !]
The default multiplication rule is too greedy
|
Now that I've generated all the charts, I realise that some resource requirements are actually too low ! I should be asking 150 MB for MultiQC, not 50 MB. I guess it worked because the jobs are too fast for MEMLIMIT to have time kill. Some COOLER_ZOOMIFY processes also take more than the 12 GB I'm requesting. The processes take 10 min, so I thought MEMLIMIT would kick in ? Anyway, I'll sort all those things out in another commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes look good to me, the groovy code to replace the GrabFiles process is a nice improvement
I've made the few changes I mentioned in #92 : just updated up / down some requirements. I reran the pipeline on all species and it worked fine. I also merged the @BethYates : this PR just needs an approval and then I can merge it |
I merged #90 by accident, so reopening a new PR.
Closes #16, #18, #20, #91
Like in sanger-tol/readmapping#82 the goal is to stop using the
process_*
labels and instead optimise the resource requests of every process.I'm using the same dataset: 10 genomes of increasing size, with 1 Hi-C library and 1 PacBio library each.
I found much less correlation than in the read-mapping pipeline. The only input size that I found useful was the genome size, now collected at the beginning of the pipeline and added to the
meta
map. There is some correlation between the number of Hi-C reads and some process runtime, but not memory usage. Since runtime estimates don't need to be very accurate (really, it's only normal/long/week that matters), I don't even pull that input size.I am using helper functions to grow values (like the number of CPUs) in a logarithm fashion. In effect, this is to limit the increase of the number of CPUs, especially as the advantage of multi-threading tends to decrease with a higher number of threads.
Also:
GrabFiles
process with some Groovy magic, as per https://community.seqera.io/t/is-it-bad-practice-to-try-and-pluck-files-from-an-element-in-a-channel-that-is-a-directory-with-channel-manipulation/224/2 . This saves 1 LSF job.GNU_SORT
parameters to fix Leftover files in /tmp from thesort
commands #91In this PR, the new resource requirements make every process succeed at the first attempt. The formulas are the lowest legible-ish correlations I could find.
Detailed charts showing the memory/CPU/time used/requested for every process: before (PDF), after (PDF)
If we want to tolerate processes failing at the first attempt, being resubmitted once or twice to finally complete, I'm sure some requirements may be lowered even more. We would have to make sure that the resources wasted on those first attempts doesn't outweigh the savings we'll make on other processes. Something to investigate later...
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).