Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TheiaCoV_FASTA_Batch: TheiaCoV_FASTA, for many samples at once #238

Merged
merged 23 commits into from
Nov 22, 2023

Conversation

sage-wright
Copy link
Member

@sage-wright sage-wright commented Nov 2, 2023

🛠️ Changes Being Made

This PR introduces a new workflow: TheiaCoV_FASTA_Batch.

Impacted Workflows/Tasks

No currently existing workflows and/or tasks will be impacted by this PR. The following new workflows/tasks have been added:

  • TheiaCoV_FASTA_Batch task: sm_theiacov_fasta_wrangling
  • TheiaCoV_FASTA_Batch workflow: theiacov_fasta_batch

The dockstore.yml file has also been edited to enable this workflow's access on Dockstore.

🧠 Context and Rationale

Occasionally, there is a need to run TheiaCoV_FASTA on thousands of samples at once. This results in the generation of thousands of VMs being created to run very short tasks. This workflow reduces the number of VMs made significantly as Pangolin4 and NextClade can both be run on multiple samples if provided a concatenated assembly file, leading to reduction of runtime costs.

Note: VADR is not yet implemented in this iteration.

📋 Workflow/Task Steps

The TheiaCoV_FASTA_Batch workflow runs the following steps

  1. version capture
  2. concatenation of all of the assembly fastas
  3. if the organism is sars-cov-2, run Pangolin4 on the concatenated assemblies
  4. if the organism is either MPXV or sars-cov-2, run NextClade on the concatenated assemblies
  5. splitting the combined outputs into sample-level files, and parsing the combined outputs to extract the values for each sample. this step also uploads the results and file-links to the sample-level table for every sample analyzed.

Inputs

Mandatory

  • Array[String] samplenames
  • Array[File] assembly_fastas
  • String seq_method (to mimic TheiaCoV_FASTA behavior)
  • String input_assembly_method (to mimic TheiaCoV_FASTA behavior)
  • String table_name (to enable upload of individual results)
  • String workspace_name (to enable upload of individual results)
  • String project_name (to enable upload of individual results)
  • String bucket_name (to enable upload of individual files so they can be accessed via a link in the Terra table)

Optional

  • String organism = "sars-cov-2" (future organism options planned)
  • String nextclade_dataset_reference = "MN908947"
  • String nextclade_dataset_tag = "2023-09-21T12:00:00Z"
  • String? nextclade_dataset_name

Outputs

These outputs are written to the set level table:

  • String theiacov_fasta_batch_version
  • String theiacov_fasta_batch_analysis_date
  • File? pango_lineage_report
  • File? nextclade_json
  • File? nextclade_tsv
  • File datatable

The following outputs are written to the sample level table via upload through the import_large_tsv.py script:

  • String seq_platform
  • String assembly_method
  • String theiacov_fasta_analysis_date
  • String theiacov_fasta_version
  • String nextclade_version
  • String nextclade_docker
  • String nextclade_ds_tag
  • String nextclade_clade
  • String nextclade_aa_subs
  • String nextclade_aa_dels
  • String nextclade_lineage
  • File nextclade_json
  • String pangolin_docker
  • String pangolin_version
  • String pango_lineage
  • String pango_lineage_expanded
  • String pangolin_conflicts
  • String pangolin_notes
  • File pangolin_lineage_report

Impacted Outputs

Although none of the origin tasks have been adjusted, it is essential to check that the results of TheiaCoV_FASTA match the results of TheiaCoV_FASTA_Batch for the values written to the sample level table to ensure proper parsing and upload.

🧪 Testing

Locally

Local testing made impossible due to Terra permissions.

Terra

Please find a successful test here: https://app.terra.bio/#workspaces/cdc-terra-resources/Theiagen_Wright_SC2_Sandbox/job_history/82aa49d2-0dc7-46ec-86cb-8ec5278abaf8

Scenarios for Reviewer to Test

  • several samples with weird names (all numbers, many underscores, etc.)
  • a lot of samples
  • samples that do not produce Pangolin or NextClade results
  • organisms other than SARS-CoV-2

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

@cimendes
Copy link
Member

cimendes commented Nov 7, 2023

Documentation has been updated to include TheiaCoV_FASTA_Batch under the TheiaCoV workflow series: https://www.notion.so/theiagen/TheiaCoV-Genomic-Characterization-03b08e336b4645b6bb70746655d928f8

@cimendes
Copy link
Member

cimendes commented Nov 7, 2023

This workflow is compatible with monkeypox virus, if not more, so is important to add it to the suite of tests to be performed

@frankambrosio3
Copy link
Contributor

Copy link
Contributor

@frankambrosio3 frankambrosio3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panglin_version typo fix


pangolin_version = pango_lineage_report.loc[pango_lineage_report["taxon"] == assembly_name]["pangolin_version"].item()
version = pango_lineage_report.loc[pango_lineage_report["taxon"] == assembly_name]["version"].item()
upload_table["panglin_version"] = "pangolin {}; {}".format(pangolin_version, version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panglin_version -> pangolin_version

Copy link
Contributor

@frankambrosio3 frankambrosio3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankambrosio3 frankambrosio3 merged commit 358941a into main Nov 22, 2023
5 checks passed
@sage-wright sage-wright deleted the smw-theiacov-fasta-batch-dev branch November 28, 2023 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants