Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate BUSCO to nf-core module #730

Open
wants to merge 22 commits into
base: dev
Choose a base branch
from

Conversation

dialvarezs
Copy link
Contributor

@dialvarezs dialvarezs commented Dec 12, 2024

This PR migrates from the local BUSCO module to the nf-core one.

Main changes

  • Updates to latest BUSCO version (v5.8.2)
  • Uses csvtk/concat to merge BUSCO and UNTAR to prepare database instead of the custom modules
  • BUSCO now runs in batch mode, so it runs one process per sample instead of one per bin
  • In the current version, BUSCO only generates a single file, not one per domain and one specifiic, so the GTDBTk part that uses QC metrics to filter bins got simplified.
  • Removed the logic related to collect "failed bins", they were not being used.
  • Handle --save_busco_db with publishDir directly.
  • Replace COMBINE_TSV local module by nf-core csvtk/concat

Breaking changes

  • --busco_clean is not supported in this new setup, so the param should be deprecated

PR checklist

Closes #484.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@dialvarezs dialvarezs changed the title Migrate to nf-core BUSCO module (WIP) Migrate BUSCO to nf-core module Dec 12, 2024
@nf-core-bot
Copy link
Member

nf-core-bot commented Dec 12, 2024

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.1.0.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Dec 12, 2024

@jfy133 It's green now 🎉

Some questions:

  • How should I deprecate a param (busco_clean)?

  • About --save_busco_db, the current module doesn't have the db downloads as output, so I'm adding it here: Add busco downloads as output modules#7210. I will try with publishDir. I'll try using publishDir. Previously, the issue was that multiple BUSCO processes were publishing to the same directory, but now BUSCO runs per sample rather than per bin, which should result in far fewer simultaneous publish operations.

EDIT:

Regarding second point, I managed to get a pretty simple and elegant solution to store all the downloaded lineages across all the samples, basically puting each of the lineage directories as ouputs instead of the whole busco_downloads directory. So, the custom module for that is no longer necessary.

The PR should be ready merging nf-core/modules#7210.

@dialvarezs dialvarezs marked this pull request as ready for review December 12, 2024 19:55
@prototaxites
Copy link
Contributor

How should I deprecate a param (busco_clean)?

Is there scope to add this as an optional input to the BUSCO module itself? For some context, the problem I encountered with BUSCO was that the temporary working files take up a lot of storage space on disk, so running BUSCO on (in my case) 1000s of bins meant that my available scratch quota was quickly filling up and leading to mag runs that couldn't finish. So it would be good to try and keep this as an option if possible as I imagine this is a plausible issue for other mag users also!

@dialvarezs
Copy link
Contributor Author

Hi @prototaxites,
I don’t think there should be any issue with porting the "clean" part of the script to the nf-core module. I’ll go ahead and open a PR for it.
By the way, the temporary storage usage should be significantly reduced now, as it only runs one BUSCO instance per sample.

@prototaxites
Copy link
Contributor

By the way, the temporary storage usage should be significantly reduced now, as it only runs one BUSCO instance per sample.

If that's the case, maybe we don't need it?

@dialvarezs
Copy link
Contributor Author

The tests I ran for this PR look like this:

auto
image

auto_prok
image

How do these numbers look to you?
For 20 samples, I estimate we could trim ~150 GB in auto (and about half of that in auto_prok).

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Jan 2, 2025

I made a last minor addition to this PR, I replaced the module used to combine bin qc summaries by a nf-core one (csvtk/concat).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants