-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TheiaCoV_FASTA_Batch: TheiaCoV_FASTA, for many samples at once #238
Conversation
…e splitting and upload to google bucket
Documentation has been updated to include TheiaCoV_FASTA_Batch under the TheiaCoV workflow series: https://www.notion.so/theiagen/TheiaCoV-Genomic-Characterization-03b08e336b4645b6bb70746655d928f8 |
This workflow is compatible with monkeypox virus, if not more, so is important to add it to the suite of tests to be performed |
testing on all viral species here: https://app.terra.bio/#workspaces/theiagen-validations/ambrosio_validation_sandbox/job_history/7a330f52-ca1a-4011-8018-0355fd7f638a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panglin_version typo fix
|
||
pangolin_version = pango_lineage_report.loc[pango_lineage_report["taxon"] == assembly_name]["pangolin_version"].item() | ||
version = pango_lineage_report.loc[pango_lineage_report["taxon"] == assembly_name]["version"].item() | ||
upload_table["panglin_version"] = "pangolin {}; {}".format(pangolin_version, version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panglin_version -> pangolin_version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐
🛠️ Changes Being Made
This PR introduces a new workflow: TheiaCoV_FASTA_Batch.
Impacted Workflows/Tasks
No currently existing workflows and/or tasks will be impacted by this PR. The following new workflows/tasks have been added:
sm_theiacov_fasta_wrangling
theiacov_fasta_batch
The
dockstore.yml
file has also been edited to enable this workflow's access on Dockstore.🧠 Context and Rationale
Occasionally, there is a need to run TheiaCoV_FASTA on thousands of samples at once. This results in the generation of thousands of VMs being created to run very short tasks. This workflow reduces the number of VMs made significantly as Pangolin4 and NextClade can both be run on multiple samples if provided a concatenated assembly file, leading to reduction of runtime costs.
Note: VADR is not yet implemented in this iteration.
📋 Workflow/Task Steps
The TheiaCoV_FASTA_Batch workflow runs the following steps
Inputs
Mandatory
Array[String] samplenames
Array[File] assembly_fastas
String seq_method
(to mimic TheiaCoV_FASTA behavior)String input_assembly_method
(to mimic TheiaCoV_FASTA behavior)String table_name
(to enable upload of individual results)String workspace_name
(to enable upload of individual results)String project_name
(to enable upload of individual results)String bucket_name
(to enable upload of individual files so they can be accessed via a link in the Terra table)Optional
String organism = "sars-cov-2"
(future organism options planned)String nextclade_dataset_reference = "MN908947"
String nextclade_dataset_tag = "2023-09-21T12:00:00Z"
String? nextclade_dataset_name
Outputs
These outputs are written to the set level table:
String theiacov_fasta_batch_version
String theiacov_fasta_batch_analysis_date
File? pango_lineage_report
File? nextclade_json
File? nextclade_tsv
File datatable
The following outputs are written to the sample level table via upload through the
import_large_tsv.py
script:String seq_platform
String assembly_method
String theiacov_fasta_analysis_date
String theiacov_fasta_version
String nextclade_version
String nextclade_docker
String nextclade_ds_tag
String nextclade_clade
String nextclade_aa_subs
String nextclade_aa_dels
String nextclade_lineage
File nextclade_json
String pangolin_docker
String pangolin_version
String pango_lineage
String pango_lineage_expanded
String pangolin_conflicts
String pangolin_notes
File pangolin_lineage_report
Impacted Outputs
Although none of the origin tasks have been adjusted, it is essential to check that the results of TheiaCoV_FASTA match the results of TheiaCoV_FASTA_Batch for the values written to the sample level table to ensure proper parsing and upload.
🧪 Testing
Locally
Local testing made impossible due to Terra permissions.
Terra
Please find a successful test here: https://app.terra.bio/#workspaces/cdc-terra-resources/Theiagen_Wright_SC2_Sandbox/job_history/82aa49d2-0dc7-46ec-86cb-8ec5278abaf8
Scenarios for Reviewer to Test
🔬 Quality checks
Pull Request (PR) checklist: