Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to auto-create samplesheet for nf-core/phageannotator #543

Draft
wants to merge 22 commits into
base: dev
Choose a base branch
from

Conversation

CarsonJM
Copy link
Contributor

@CarsonJM CarsonJM commented Nov 20, 2023

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Nov 20, 2023

nf-core lint overall result: Failed ❌

Posted for pipeline commit 5b37df5

+| ✅ 201 tests passed       |+
#| ❔   2 tests were ignored |#
!| ❗   4 tests had warnings |!
-| ❌  10 tests failed       |-

❌ Test failures:

  • files_exist - File must be removed: lib/Utils.groovy
  • files_exist - File must be removed: lib/WorkflowMain.groovy
  • files_exist - File must be removed: lib/NfcoreTemplate.groovy
  • files_exist - File must be removed: lib/WorkflowMag.groovy
  • files_unchanged - .github/CONTRIBUTING.md does not match the template
  • files_unchanged - .github/PULL_REQUEST_TEMPLATE.md does not match the template
  • files_unchanged - .github/workflows/branch.yml does not match the template
  • files_unchanged - .github/workflows/linting_comment.yml does not match the template
  • files_unchanged - .github/workflows/linting.yml does not match the template
  • files_unchanged - pyproject.toml does not match the template

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file [TODO: try and test using for --host_fasta and --host_genome]
  • pipeline_todos - TODO string in WorkflowMag.groovy: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • schema_lint - Input mimetype is missing or empty

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.13.1
  • Run at 2024-04-30 20:21:42

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran out of time before baby came back so couldn't test, but first pass: very clean code thank you very much @CarsonJM !

Couple of things missing (docs atm):

  • Missing changelog update
  • Missing update to README
  • Missing input/output docs

nextflow_schema.json Outdated Show resolved Hide resolved
@CarsonJM CarsonJM self-assigned this Nov 30, 2023
@CarsonJM CarsonJM added the enhancement New feature or request label Nov 30, 2023
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more lucid feedback, sorry didn't think of these before

nextflow.config Outdated Show resolved Hide resolved
subworkflows/local/samplesheet_creation.nf Outdated Show resolved Hide resolved
subworkflows/local/samplesheet_creation.nf Outdated Show resolved Hide resolved
subworkflows/local/samplesheet_creation.nf Outdated Show resolved Hide resolved
subworkflows/local/samplesheet_creation.nf Outdated Show resolved Hide resolved
conf/modules.config Outdated Show resolved Hide resolved
modules/local/mag_merge_samplesheet.nf Outdated Show resolved Hide resolved
@CarsonJM
Copy link
Contributor Author

@nf-core-bot fix linting

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking thoughts:

  • Q: I wonder if we should put this into an umbrella workflow rather than the if/else statement
  • Q: The test profile should ideally be test_generatesamplesheet rather than just test_samplesheet (someone might think it's to use an input samplesheet)
  • Q: Are you definitely sure people would always want to collapse all assemblies per sample (no cases they would want to keep it separate)

subworkflows/local/create_phageannotator_samplesheet.nf Outdated Show resolved Hide resolved
subworkflows/local/create_phageannotator_samplesheet.nf Outdated Show resolved Hide resolved
@CarsonJM
Copy link
Contributor Author

CarsonJM commented Feb 9, 2024

Regarding your questions:

  1. I think it would be a good idea to create an umbrella workflow for generating samplesheets for any downstream nf-core pipeline
  2. Great suggestion, I will make that change
  3. It is possible that they would want to keep them separate, so I can make that an option!

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK overall this looks good to me now - the only thing I don't like is the paths to work directories for the files - I strongly feel that it should be the relevant subdirectory of params.outdir + filename.

README.md Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
Co-authored-by: James A. Fellows Yates <[email protected]>
Comment on lines 106 to 112
// Create samplesheet for each sample using meta information
ch_mag_id_samplesheets = ch_mag_metadata.collectFile() { meta ->
[ "${meta.id}_phageannotator_samplesheet.csv", "sample,group,fastq_1,fastq_2,fasta" + '\n' + "${meta.id},${meta.group},${meta.fastq_1},${meta.fastq_2},${meta.fasta}" + '\n' ]
}

// Merge samplesheet across all samples for the pipeline
ch_mag_id_samplesheets.collectFile(name: "phageannotator_samplesheet.csv", keepHeader:true, skip:1, storeDir:"${params.outdir}/downstream_samplesheets/")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfy133 Just to clarify, are you thinking I should add the storeDir:"${params.outdir}/downstream_samplsheets/" to the first collectFile()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, more don't bind just fasta to the string on the line above.

params.outdir + <path I don't remember> + fasta.name or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the FastQ files, it's possible that the users don't specify to save them to outdir, so should part of the generate_downstream_samplesheet be to copy the files to outdir even if not specified elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even just validate that if you turn on samplesheet creation, then you must also save the fastqs.

But yes, one way or another - running from explicitly saved fsstqs in outdir is preferable.

If you have cleanup = true in a config, something many do but the user never sees, they won't have any fastq files in the work dir to point to in the sample sheet 😬. This is harder to control for rather than directing the user to make sure to save the fsstqs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me!

Comment on lines +110 to +111
file(meta.fastq_1.toUriString(), checkIfExists: true).copyTo("${params.outdir}/downstream_samplesheets/fastq/${meta.fastq_1.name}")
file(meta.fasta, checkIfExists: true).copyTo("${params.outdir}/downstream_samplesheets/fasta/${meta.fasta.name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for some advise about this, this is indeed very clean, but given it's outside a process/publishdir I'm a bit unsure. E.g. some people use symlinks from work to results dir rather than copy, so this would violate this.

Secondly this woudl potentially result in two copies of the same read files if one of the --save_* parameters are given.

It maybe we have to come up with some complicated logic that instead picks the right 'final' directory for the proceessing reads, and it's that that you append to the beginning, and then add.

I think this would come with two steps:

  1. Make sure one of --save_clipped_reads, --save_hostremoved_reads, --save_phixremoved_reads, --save_bbnorm_reads are selected if params.generate_samplesheet is true

    So some input validation code like:

    if ( params.generate_samplesheet && ![params.saved_clipped, save.hostremoved <...>].any() ) { Nextflow.error('[nf-core/mag] ERROR: must at least save one XYX if --generate_samplesheet <...>') }
    
  2. Then have to have some complicated logic to select which directory gets appended to the beginning of the file name (probably best in a case when but w/e), e.g.

    if (params.save_clipped_reads && !params.save_phix && !params.save_host_removed && !params._save_bbnorm) { 
    
samplehseet_reads_dir = /subdir/clipedreads/saved/in
   } else if ( params.save_phix && !params.save_host_removed && !params._save_bbnorm){
   samplesheet_reads_dir = /subdir/phixreads/saved/in
   }
 ```

Copy link
Contributor Author

@CarsonJM CarsonJM Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally see what you're saying and was curious what thoughts would be about doing this outside a process. I am more than happy change it to what you're suggesting!

The reason I was giving this approach a try was I was hoping to find a generalizable approach that would work across pipelines/updates even if the save input logic you're talking about gets modified (also, I selfishly wanted to find a quicker approach haha) Also, I was hoping to force a copy because in the past I've symlinked to publishDir thinking I copied and had my reads disappear because my workDir is in a scratch directory 😅.

Still, I definitely have though about what you're saying and will make that change if you think it's best!

@jfy133 jfy133 marked this pull request as draft December 11, 2024 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants