Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Retrieve_SRR_Metadata] New wf to retrieve SRR after Terra2NCBI wf #668

Merged
merged 28 commits into from
Nov 26, 2024

Conversation

fraser-combe
Copy link
Contributor

@fraser-combe fraser-combe commented Nov 4, 2024

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This PR introduces a new workflow, Retrieve_SRR_Metadata, designed to retrieve the SRA accession (SRR) associated with a given sample accession (such as a BioSample ID or SRA Experiment ID). Documentation has been added to guide users on accepted input types and workflow usage.

⚡ Impacted Workflows/Tasks

A new Retrieve_SRR_Metadata workflow and task with a clarified input description for sample_accession, specifying accepted accession types like BioSample ID or SRA Experiment ID. The documentation has been enhanced to reflect these updates.

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

  • Added Retrieve_SRR_Metadata workflow.
  • Introduced fetch_srr_metadata task within this workflow to retrieve SRR metadata.
  • Enhanced documentation for clear input requirements.

⚙️ Algorithm

No algorithm or processing changes; this is a new workflow. We use fastq-dl docker image

➡️ Inputs

sample_accession: New input added, accepting BioSample ID or SRA Experiment ID for metadata retrieval.

⬅️ Outputs

srr_accession: New output that stores the retrieved SRR accession.

🧪 Testing

Verified with Biosample Accession numbers
Successful with list of Biosample accesstions - retrieved correct SRR based on previous manual curation of SRRs (column SRR) after a Terra2NCBI run
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/933438ad-65ee-413a-ab87-101b08751297
see table output - c_auris_mv (10)

Testing incorrect ID fails
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/04b443bb-10ee-48bc-b3c2-4006cbbbcf5b

Suggested Scenarios for Reviewer to Test

Valid BioSample ID: Test the workflow with a BioSample ID to confirm SRR metadata retrieval.
Could also test with SRA Experiment ID: Test with an SRA Experiment ID and ensure correct SRR retrieval.
Invalid Accession Input: Provide an invalid ID format to verify that the workflow fails gracefully.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@fraser-combe fraser-combe marked this pull request as ready for review November 5, 2024 20:20
@fraser-combe fraser-combe requested a review from a team as a code owner November 5, 2024 20:20
@Michal-Babins Michal-Babins self-assigned this Nov 6, 2024
@sage-wright sage-wright closed this Nov 7, 2024
@sage-wright sage-wright deleted the fc-return-srr-wf-dev branch November 7, 2024 20:18
@fraser-combe fraser-combe restored the fc-return-srr-wf-dev branch November 7, 2024 20:27
@fraser-combe fraser-combe reopened this Nov 7, 2024
@cimendes
Copy link
Member

cimendes commented Nov 8, 2024

In the documentation, the new page needs to be added to:

  • workflows_alphabetically.md
  • workflows_kingdom.md
  • workflows_type.md
    The full guide is here!

@fraser-combe fraser-combe requested a review from cimendes November 8, 2024 21:55
@sage-wright
Copy link
Member

Can you standardize the name of this workflow? It's referred to as many things throughout the code and documentation.

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clean up required

workflows/utilities/data_import/wf_update_srr_metadata.wdl Outdated Show resolved Hide resolved
tasks/utilities/data_handling/task_fetch_srr_metadata.wdl Outdated Show resolved Hide resolved
tasks/utilities/data_handling/task_fetch_srr_metadata.wdl Outdated Show resolved Hide resolved
tasks/utilities/data_handling/task_fetch_srr_metadata.wdl Outdated Show resolved Hide resolved
tasks/utilities/data_handling/task_fetch_srr_metadata.wdl Outdated Show resolved Hide resolved
docs/workflows_overview/workflows_kingdom.md Outdated Show resolved Hide resolved
@Michal-Babins
Copy link
Contributor

I tested with c_auris_nv using the datatables ids and all runs succeeded"
https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/d4c83b91-b38d-4904-a7c0-8ac68ccdf0de

The expected behavior for this was to fail.

@Michal-Babins
Copy link
Contributor

I tested with c_auris_nv using the datatables ids and all runs succeeded" https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/d4c83b91-b38d-4904-a7c0-8ac68ccdf0de

The expected behavior for this was to fail.

ERROR is captured in stderr of execution directory. Might want to implement a pipefail command like set -eou pipefail so that thestatus is the status of the last command that failed, rather than just the last command in the pipeline.

@sage-wright
Copy link
Member

definitely agree with michal to add a set -euo pipefail to prevent silent errors

@Michal-Babins
Copy link
Contributor

Re-ran same data that succeeded when expected behavior was failure after update with setting the pipefail and I am getting correct failures now:
https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/8539b680-e9c1-49bd-9f14-16c94c8cd264

@sage-wright
Copy link
Member

sage-wright commented Nov 20, 2024

wait a second,, i think i misunderstood, i think we do want it to succeed as the column in the table ends up saying no srr accession identified?

not sure what i'd prefer, an actual failure or a success with a message saying it doesn't work. @kapsakcj or @theiadeb, what would be preferred by our users?

@fraser-combe
Copy link
Contributor Author

fraser-combe commented Nov 20, 2024

To me the way I designed the workflow is it should fail if the input ID/string is invalid (not a valid BioSample or SRA ID), as this indicates an error that needs user attention. However, if the input is valid (a proper BioSample or SRA ID), the workflow should succeed and output 'no SRR accession found' if no SRR is identified. But open to suggestions if people feel it would work better a different way

@fraser-combe
Copy link
Contributor Author

ran on 2 additional samples with no SRR produced expected result string "no SRR accession found", https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/063ea71c-7001-4482-b715-db05ff26de2e

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the logic regarding valid and invalid srr accessions is not correct; please fix that, and then also address the other small non-functional changes I've requested.

@fraser-combe
Copy link
Contributor Author

@Michal-Babins
Copy link
Contributor

Valid Biosample IDs succeed, 10 samples with SRR, 2 samples - "no SRR accession" and 1 samples multiple SRRs handled https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/6a892fef-efd8-4044-9503-083f9f8e5b44 Invalid ID fails the workflow as expected https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/9f486b7f-36db-4454-9838-5718b37a7042

I am able to confirm this behavior.
All failures found here:
https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/6340cb04-d5c0-4d83-b8fb-5242859bfad8

All success found here with 'no SRR accession' found for 2 samples, and 1 with multiples:

https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/5abe7c72-95ac-4ee6-a3e3-18c8b2a947ab

@sage-wright
Copy link
Member

Code changes are working as we want. Once the docs are updated, I'll merge! ⭐

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐ this will be so useful, well done!

@sage-wright sage-wright merged commit b4aad55 into main Nov 26, 2024
5 checks passed
@fraser-combe fraser-combe deleted the fc-return-srr-wf-dev branch November 26, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants