Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set save_intermediate_files to false by default #52

Merged
merged 3 commits into from
Jul 21, 2022

Conversation

Faizal-Eeman
Copy link
Contributor

@Faizal-Eeman Faizal-Eeman commented Jul 20, 2022

Description

save_intermediate_files is set to false by default so that intermediate files will not be saved when samples are analyzed in production.

Recommended in PR #44 in this issue comment.

Closes #50

Testing Results

  • DNA A-mini
    • sample: A-mini S2.T-0
    • input csv: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/input/tumor_control_pair_0.csv
    • config: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-save_intermediate_files-default-to-false/config/A-mini-hg38.config
    • output: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-save_intermediate_files-default-to-false/A-mini/call-sSV-3.0.0/S2.T-0/DELLY-1.0.3/

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the [github standards](https://confluence.mednet.ucla.edu/pages/viewpage.action?spaceKey=BOUTROSLAB&title=GitHub+Standards#GitHubStanda
    rds-Branchprotectionrule) before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one DNA A-mini sample and one A-full sample. The paths to the test config files and output directories were attached above.

config/default.config Outdated Show resolved Hide resolved
@@ -14,7 +14,7 @@ params {
max_memory = SysHelper.getAvailMemory()

cache_intermediate_pipeline_steps = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this argument - it goes to

process.cache = params.cache_intermediate_pipeline_steps

But we also have L37 cache=true

Copy link
Contributor Author

@Faizal-Eeman Faizal-Eeman Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point Taka!

As I understand from NextFlow documentation on process cache, this allows users to resume the pipeline in case a process fails and the process needs to continue after fixing the failure. By default cache = true.

But since we are also specifying the process.cache in methods.config by using cache_intermediate_pipeline_steps = false from default.config, shouldn't this technically introduce conflict as we are having two boolean values for the same process cache argument? One from L37 cache_intermediate_pipeline_steps=false and another from L37 cache=true? Unless the cache_intermediate_pipeline_steps=false over-rides the cache=true argument and we are setting the pipeline to not cache any process files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the pipeline runs smoothly and when the pipeline fails because of an error, I haven't seen in the logs Nextflow's recommendation to resume the pipeline after fixing the error. So this means, process.cache is taken as false. I can comment cache=true and try if the pipeline runs and delivers results as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, for common variables like process.cache, there's a hierarchy of values, where for example the process-specific settings override the general process-level settings. In this case though, since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.

Also, as a note, the resume option wouldn't work with the submission script since the working directory gets deleted upon node shutdown. It should work if testing interactively though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.

since it's the same directive in the same scope, the value just overwrite and the latest setting is the one that's kept.

@yashpatel6 So cache_intermediate_pipeline_steps=false overwrites cache=true? Then should we just leave these arguments as such or would you recommend to drop cache=true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashpatel6 Does the retry function require cache = true? I thought that was the case but I'm not 100% sure. The cache is enabled by default (as in NF default) so I think we kept it in the template for development purposes.

The retry function doesn't require cache = true; the cache is used for when an entire pipeline is re-run. Nextflow basically checks if previously run processes completed successfully and uses those results rather than re-running processes that were identical in the previous run and finished successfully; it basically lets the pipeline pick up from the process that failed.

With retry, the pipeline doesn't stop running and just the specific process is re-launched, with updated configs/allocations if necessary. Technically, the results of previous processes are still available with retry because the pipeline hasn't actually stopped running.

This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do cache = false as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @yashpatel6, very helpful!

@tyamaguchi-ucla So can we hold what to do with cache settings for now? and perhaps make changes after call-sSV 3.0.0 release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let's discuss this at the NF WG. We can finish up #46 and then release a new version. We generally want to discuss changes that require a template update. Also, it might be helpful for you to understand how we developed the current config structure. @yashpatel6 maybe we can create a GH discussion pointing to some key PRs (template and call-FusionTranscript) for new developers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. We should post this to the NF WG repo GH discussions with some links (maybe I didn't google enough). So, that means we can just do cache = false as default for most cases, which might improve run time, etc for large samples like call-gSNP + large WGS N/T samples?

We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.

I'm going through some old PRs so we can make a thread with some of the key decisions/discussions we had regarding pipelines, we can review some at the NF WG next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Yash.

We could; I'm not sure how big of a difference it would make since the caching is basically just saving some index keys but it's worth testing with a pipeline like call-gSNP for sure.

Yeah, it might be more dependent on the number of jobs. In either case, it's worth checking.

@Faizal-Eeman Faizal-Eeman merged commit b488e8c into main Jul 21, 2022
@Faizal-Eeman Faizal-Eeman deleted the mmootor-save_intermediate_files-default-to-false branch July 21, 2022 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

save_intermediate_files should be false as default and remove unnecessary variables from config files
3 participants