Set `save_intermediate_files` to false by default #52

Faizal-Eeman · 2022-07-20T02:53:36Z

Description

save_intermediate_files is set to false by default so that intermediate files will not be saved when samples are analyzed in production.

Recommended in PR #44 in this issue comment.

Closes #50

Testing Results

DNA A-mini
- sample: A-mini S2.T-0
- input csv: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/input/tumor_control_pair_0.csv
- config: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-save_intermediate_files-default-to-false/config/A-mini-hg38.config
- output: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-save_intermediate_files-default-to-false/A-mini/call-sSV-3.0.0/S2.T-0/DELLY-1.0.3/

Checklist

I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the [github standards](https://confluence.mednet.ucla.edu/pages/viewpage.action?spaceKey=BOUTROSLAB&title=GitHub+Standards#GitHubStanda
rds-Branchprotectionrule) before opening this pull request.
I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)
I have tested the pipeline on at least one DNA A-mini sample and one A-full sample. The paths to the test config files and output directories were attached above.

config/default.config

tyamaguchi-ucla · 2022-07-20T18:26:47Z

config/default.config

@@ -14,7 +14,7 @@ params {
    max_memory = SysHelper.getAvailMemory()

    cache_intermediate_pipeline_steps = false


Not sure if we need this argument - it goes to

pipeline-call-sSV/pipeline/config/methods.config

Line 187 in 4da6d8b

process.cache = params.cache_intermediate_pipeline_steps

But we also have L37 cache=true

Interesting point Taka!

As I understand from NextFlow documentation on process cache, this allows users to resume the pipeline in case a process fails and the process needs to continue after fixing the failure. By default cache = true.

But since we are also specifying the process.cache in methods.config by using cache_intermediate_pipeline_steps = false from default.config, shouldn't this technically introduce conflict as we are having two boolean values for the same process cache argument? One from L37 cache_intermediate_pipeline_steps=false and another from L37 cache=true? Unless the cache_intermediate_pipeline_steps=false over-rides the cache=true argument and we are setting the pipeline to not cache any process files.

So far the pipeline runs smoothly and when the pipeline fails because of an error, I haven't seen in the logs Nextflow's recommendation to resume the pipeline after fixing the error. So this means, process.cache is taken as false. I can comment cache=true and try if the pipeline runs and delivers results as expected.

Generally, for common variables like process.cache, there's a hierarchy of values, where for example the process-specific settings override the general process-level settings. In this case though, since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.

Also, as a note, the resume option wouldn't work with the submission script since the working directory gets deleted upon node shutdown. It should work if testing interactively though.

Understood.

since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.

@yashpatel6 So cache_intermediate_pipeline_steps=false overwrites cache=true? Then should we just leave these arguments as such or would you recommend to drop cache=true?

@yashpatel6 Does the retry function require cache = true? I thought that was the case but I'm not 100% sure. The cache is enabled by default (as in NF default) so I think we kept it in the template for development purposes.

The retry function doesn't require cache = true; the cache is used for when an entire pipeline is re-run. Nextflow basically checks if previously run processes completed successfully and uses those results rather than re-running processes that were identical in the previous run and finished successfully; it basically lets the pipeline pick up from the process that failed.

With retry, the pipeline doesn't stop running and just the specific process is re-launched, with updated configs/allocations if necessary. Technically, the results of previous processes are still available with retry because the pipeline hasn't actually stopped running.

This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do cache = false as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?

Thanks so much @yashpatel6, very helpful!

@tyamaguchi-ucla So can we hold what to do with cache settings for now? and perhaps make changes after call-sSV 3.0.0 release?

Yup, let's discuss this at the NF WG. We can finish up #46 and then release a new version. We generally want to discuss changes that require a template update. Also, it might be helpful for you to understand how we developed the current config structure. @yashpatel6 maybe we can create a GH discussion pointing to some key PRs (template and call-FusionTranscript) for new developers?

This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do cache = false as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?

We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.

I'm going through some old PRs so we can make a thread with some of the key decisions/discussions we had regarding pipelines, we can review some at the NF WG next week.

Thanks, Yash.

We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.

Yeah, it might be more dependent on the number of jobs. In either case, it's worth checking.

Closes #50 - save_intermediate_files set to false by default

4c6da77

Faizal-Eeman requested a review from tyamaguchi-ucla July 20, 2022 02:53

tyamaguchi-ucla reviewed Jul 20, 2022

View reviewed changes

tyamaguchi-ucla assigned tyamaguchi-ucla and yashpatel6 Jul 20, 2022

remove save_intermediate_files argument from default.config

928f696

Faizal-Eeman mentioned this pull request Jul 20, 2022

Remove cache=true in default.config #53

Open

Faizal-Eeman requested a review from yashpatel6 July 20, 2022 22:12

Update CHANGELOG.md

28d9f83

tyamaguchi-ucla approved these changes Jul 21, 2022

View reviewed changes

Faizal-Eeman merged commit b488e8c into main Jul 21, 2022

Faizal-Eeman deleted the mmootor-save_intermediate_files-default-to-false branch July 21, 2022 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `save_intermediate_files` to false by default #52

Set `save_intermediate_files` to false by default #52

Faizal-Eeman commented Jul 20, 2022 •

edited

Loading

tyamaguchi-ucla Jul 20, 2022

Faizal-Eeman Jul 20, 2022 •

edited

Loading

Faizal-Eeman Jul 20, 2022

yashpatel6 Jul 20, 2022

Faizal-Eeman Jul 20, 2022

tyamaguchi-ucla Jul 21, 2022

Faizal-Eeman Jul 21, 2022

tyamaguchi-ucla Jul 21, 2022

yashpatel6 Jul 21, 2022

tyamaguchi-ucla Jul 21, 2022

		@@ -14,7 +14,7 @@ params {
		max_memory = SysHelper.getAvailMemory()

		cache_intermediate_pipeline_steps = false

Set save_intermediate_files to false by default #52

Set save_intermediate_files to false by default #52

Conversation

Faizal-Eeman commented Jul 20, 2022 • edited Loading

Description

Closes #50

Testing Results

Checklist

Choose a reason for hiding this comment

Faizal-Eeman Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Set `save_intermediate_files` to false by default #52

Set `save_intermediate_files` to false by default #52

Faizal-Eeman commented Jul 20, 2022 •

edited

Loading

Faizal-Eeman Jul 20, 2022 •

edited

Loading