Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TheiaCoV/TheiaProk/TheiaMeta/TheiaEuk/Freyja_FASTQ] fastq-scan updates & improvements. Adding JSON as wf output file #662

Merged
merged 19 commits into from
Nov 8, 2024

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Oct 28, 2024

Will update this message later

This PR closes #471 and closes #571

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This PR updates 2 frequently used tasks (PE and SE versions of the task) tasks/quality_control/basic_statistics/task_fastq_scan.wdl and all workflows that use this task to run fastq-scan

I mainly did this PR so that the fastq-scan JSON output files are output at the workflow level, but improved other aspects along the way

⚡ Impacted Workflows/Tasks

  • tasks/quality_control/basic_statistics/task_fastq_scan.wdl
    • added set -euo pipefail in case an error is thrown
    • reduced cpu to 1 as fastq-scan likely cannot utilize multiple threads (AFAIK)
    • reduced disk_size to 50 (don't need much space unless FASTQs are huge)
    • enabled preemptible VMs to reduce cost since it doesn't take very long to run (usually less than 4 minutes)
    • upgraded to fastq-scan v1.0.1 from v0.4.4
    • copied biocontainer docker image to the Theiagen GAR so we don't rely upon quay-hosted biocontainer docker image
    • removed DATE variable and pipeline_date output string as it's unnecessary
    • improved and cleaned up logging and what is printed to STDOUT so it's more clear to user and support team
    • renamed output variable name from read1_fastq_scan_report to read1_fastq_scan_json
  • workflows with 2 or 4 new JSON output files from fastq-scan, depending on paired end or single end data
    • workflows/freyja/wf_freyja_fastq.wdl
    • workflows/theiacov/wf_theiacov_clearlabs.wdl
    • workflows/theiacov/wf_theiacov_illumina_pe.wdl
    • workflows/theiacov/wf_theiacov_illumina_se.wdl
    • workflows/theiaeuk/wf_theiaeuk_illumina_pe.wdl
    • workflows/theiameta/wf_theiameta_illumina_pe.wdl
    • workflows/theiaprok/wf_theiaprok_illumina_pe.wdl
    • workflows/theiaprok/wf_theiaprok_illumina_se.wdl
    • workflows/utilities/wf_read_QC_trim_pe.wdl
    • workflows/utilities/wf_read_QC_trim_se.wdl
  • Other minor and somewhat unrelated changes:
    • updated CI conda environment to not include defaults channel due to Anaconda changing their ToS so you have to pay if you use that channel when your organization includes 200+ employees. AFAIK the CI environment does not use any packages from that channel, but better to be safe than sorry. CI runs totally fine without the channel.
    • workflows/utilities/wf_read_QC_trim_ont.wdl (no new output files, just corrected the meta: description section)'
    • fixed a bug in export_taxon_tables task that was causing silent error, described here: Prevent Silent Errors #666 (comment)

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

⚙️ Algorithm

upgraded to v1.0.1 of fastq-scan which has support for "large" FASTQs. Not entirely sure what that means other than a slight change in the cpp code, but it's more robust

➡️ Inputs

listed above. Inputs that changed are all runtime related

⬅️ Outputs

new JSON outputs for read1 and read2, both raw FASTQs and "cleaned" FASTQs to all subworkflows (read_QC subwfs) and workflows that use this task

🧪 Testing

  • Freyja_FASTQ
    • terra wf test
    • ✅ WF ran successfully, 4 new JSON outputs appear in data table.
  • theiacov_clearlabs
    • terra wf test
    • ✅ WF ran successfully, 2 new JSON outputs appear in data table.
  • theiacov_illumina_pe
    • terra wf test
    • ✅ WF ran successfully, 4 new JSON outputs appear in data table.
  • theiacov_illumina_se
    • terra wf test
    • ✅ WF ran successfully, 2 new JSON outputs appear in data table.
  • theiameta_illumina_pe
    • terra wf test
    • ✅ WF ran successfully, 4 new JSON outputs appear in data table.
  • theiaprok_illumina_pe with export_taxon_tables enabled
    • terra wf test
    • ✅ WF ran successfully, 4 new JSON outputs appear in data table. Additionally, these 2 outputs showed up in the exported taxon table (ecoli_specimen and salmonella_specimen data tables)
  • theiaprok_illumina_se with export_taxon_tables enabled
    • terra wf test
    • ✅ WF ran successfully, 2 new JSON outputs appear in data table. Additionally, these 2 outputs showed up in the exported taxon table (ecoli_specimen_se and salmonella_specimen_se data tables)
  • TheiaEuk_illumina_pe
    • terra wf test
    • ✅ WF ran successfully, 4 new JSON outputs appear in data table.

Suggested Scenarios for Reviewer to Test

Would be good to test any of the impacted workflows. No need to test export_taxon_table functionality as I've already done so for both TheiaProk wfs.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

…educe disk to 50gb and cpu to 1; added set -euo pipefail; removed capture of date; added debug statements to cleanup STDOUT/logs; removed unnecessary cat commands with parsing output JSON; renamed 2 output files; enabled preemptible VM usage
…; reduced disk to 50gb and cpu to 1; added set -euo pipefail; added DEBUG statements and cleaned up STDOUT for clear log files; renamed outputs to mention json; removed collection and output of DATE; enabled preemptible VMs
…l/setup CI env. hopefully that doesn't break everything
@@ -2,7 +2,6 @@ name: pytest-env-CI
channels:
- conda-forge
- bioconda
- defaults
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have more than 200 people in our organization, but most folks are moving away from using the defaults channel due to Anaconda updating their ToS: https://www.anaconda.com/blog/is-conda-free

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking up on this & making the quick fix!

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Oct 29, 2024

reminder to add the new outputs to theiameta wf as well

…tq-scan JSON outputs to theiameta_illumina_pe wf; updated read_qc_trim_ont subworkflow description since it was inaccurate
@kapsakcj kapsakcj changed the title [TheiaCoV & TheiaProk] fastq-scan updates & improvements. Adding JSON as wf output file [TheiaCoV/TheiaProk/TheiaMeta/Freyja_FASTQ] fastq-scan updates & improvements. Adding JSON as wf output file Oct 30, 2024
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Oct 30, 2024

reminder to add fastq-scan to export_taxon_tables in theiaprok SE wf

@kapsakcj kapsakcj changed the title [TheiaCoV/TheiaProk/TheiaMeta/Freyja_FASTQ] fastq-scan updates & improvements. Adding JSON as wf output file [TheiaCoV/TheiaProk/TheiaMeta/TheiaEuk/Freyja_FASTQ] fastq-scan updates & improvements. Adding JSON as wf output file Nov 4, 2024
@kapsakcj kapsakcj mentioned this pull request Nov 4, 2024
10 tasks
@kapsakcj kapsakcj marked this pull request as ready for review November 4, 2024 21:04
@kapsakcj kapsakcj requested a review from a team as a code owner November 4, 2024 21:04
@AndrewLangvt
Copy link
Contributor

@kapsakcj I know this is still in draft form, but what do you think about outputting the fastq-scan JSONs as an array? That way they only add a single column to Terra table (instead of up to 4), but also allows you to easily access them for support troubleshooting?

@kapsakcj kapsakcj marked this pull request as draft November 4, 2024 21:12
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Nov 4, 2024

moving back to to draft state since CI is failing for some unknown reason and since we're considering changing output to an array instead

@AndrewLangvt
Copy link
Contributor

After some consideration with the team, we will leave these as separate outputs, not an Array. We will engage the "output consolidation" aspect when we implement TheiaCoV/TheiaProk/etc. "light" versions.

@kapsakcj kapsakcj marked this pull request as ready for review November 6, 2024 16:41
@sage-wright sage-wright closed this Nov 7, 2024
@sage-wright sage-wright deleted the cjk-fastq-scan-json-outputs branch November 7, 2024 20:18
@kapsakcj kapsakcj restored the cjk-fastq-scan-json-outputs branch November 7, 2024 20:26
@kapsakcj kapsakcj reopened this Nov 7, 2024
Copy link
Contributor

@AndrewLangvt AndrewLangvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2,7 +2,6 @@ name: pytest-env-CI
channels:
- conda-forge
- bioconda
- defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking up on this & making the quick fix!

@AndrewLangvt AndrewLangvt merged commit 2fd9f75 into main Nov 8, 2024
25 checks passed
@AndrewLangvt AndrewLangvt deleted the cjk-fastq-scan-json-outputs branch November 8, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants