-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set save_intermediate_files
to false by default
#52
Set save_intermediate_files
to false by default
#52
Conversation
@@ -14,7 +14,7 @@ params { | |||
max_memory = SysHelper.getAvailMemory() | |||
|
|||
cache_intermediate_pipeline_steps = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need this argument - it goes to
process.cache = params.cache_intermediate_pipeline_steps |
But we also have L37 cache=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point Taka!
As I understand from NextFlow documentation on process cache, this allows users to resume
the pipeline in case a process fails and the process needs to continue after fixing the failure. By default cache = true
.
But since we are also specifying the process.cache
in methods.config
by using cache_intermediate_pipeline_steps = false
from default.config
, shouldn't this technically introduce conflict as we are having two boolean values for the same process cache argument? One from L37 cache_intermediate_pipeline_steps=false
and another from L37 cache=true
? Unless the cache_intermediate_pipeline_steps=false
over-rides the cache=true
argument and we are setting the pipeline to not cache any process files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the pipeline runs smoothly and when the pipeline fails because of an error, I haven't seen in the logs Nextflow's recommendation to resume the pipeline after fixing the error. So this means, process.cache
is taken as false
. I can comment cache=true
and try if the pipeline runs and delivers results as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, for common variables like process.cache
, there's a hierarchy of values, where for example the process-specific settings override the general process
-level settings. In this case though, since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.
Also, as a note, the resume
option wouldn't work with the submission script since the working directory gets deleted upon node shutdown. It should work if testing interactively though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood.
since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.
@yashpatel6 So cache_intermediate_pipeline_steps=false
overwrites cache=true
? Then should we just leave these arguments as such or would you recommend to drop cache=true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yashpatel6 Does the retry function require
cache = true
? I thought that was the case but I'm not 100% sure. The cache is enabled by default (as in NF default) so I think we kept it in the template for development purposes.The retry function doesn't require
cache = true
; the cache is used for when an entire pipeline is re-run. Nextflow basically checks if previously run processes completed successfully and uses those results rather than re-running processes that were identical in the previous run and finished successfully; it basically lets the pipeline pick up from the process that failed.With retry, the pipeline doesn't stop running and just the specific process is re-launched, with updated configs/allocations if necessary. Technically, the results of previous processes are still available with retry because the pipeline hasn't actually stopped running.
This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do cache = false
as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @yashpatel6, very helpful!
@tyamaguchi-ucla So can we hold what to do with cache settings for now? and perhaps make changes after call-sSV 3.0.0 release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, let's discuss this at the NF WG. We can finish up #46 and then release a new version. We generally want to discuss changes that require a template update. Also, it might be helpful for you to understand how we developed the current config structure. @yashpatel6 maybe we can create a GH discussion pointing to some key PRs (template and call-FusionTranscript) for new developers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do
cache = false
as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?
We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.
I'm going through some old PRs so we can make a thread with some of the key decisions/discussions we had regarding pipelines, we can review some at the NF WG next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Yash.
We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.
Yeah, it might be more dependent on the number of jobs. In either case, it's worth checking.
Description
save_intermediate_files
is set to false by default so that intermediate files will not be saved when samples are analyzed in production.Recommended in PR #44 in this issue comment.
Closes #50
Testing Results
Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the [github standards](https://confluence.mednet.ucla.edu/pages/viewpage.action?spaceKey=BOUTROSLAB&title=GitHub+Standards#GitHubStanda
rds-Branchprotectionrule) before opening this pull request.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline on at least one DNA A-mini sample and one A-full sample. The paths to the test config files and output directories were attached above.