Skip to content

Commit

Permalink
[TheiaCoV] Split database from Kraken2_TheiaCoV task (#670)
Browse files Browse the repository at this point in the history
* kraken2_theiacov: Split db from container, update both; minor improvements on krakren2 command call; improvememnts on target organism parse to report proportion of sc2 only if target org is sc2

* set tartget org as sc2, can be user modified

* Floar sc2 percent -> string sc2 percent to allow empty values

* passing on target org to output percent of sc2

* add kraken target org to sc2, update target org of rsv-a and rsv-b

* percent sc2 from float to string to allow empty values

* clarify input name

* add test DB for theiacov kraken2; update CI inputs for TheiaCoV clearlabs

* ops, must be File not String

* expose kraken2_theiacov database and runtime parameters

* expose runtime parameters and db for theiacov's kraken2

* update inputs

* update CI for theiacov_clearlabs

* update md5sums

* do the same for theiacov_ont

* fix bug

* update CI

* update CI

* update docs - freyja

* update docs - theiacov

* make target orgnamism modifiable

* update CI

* ops npw with docs update
  • Loading branch information
cimendes authored Dec 5, 2024
1 parent fd1be2e commit be1247b
Show file tree
Hide file tree
Showing 25 changed files with 173 additions and 104 deletions.
5 changes: 3 additions & 2 deletions docs/workflows/genomic_characterization/freyja.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ This workflow runs on the sample level.
| freyja | **number_bootstraps** | Int | The number of bootstraps to perform (only used if bootstrap = true) | 100 | Optional |
| freyja | **update_db** | Boolean | Updates the Freyja reference files (the usher barcodes and lineage metadata files) but will not save them as output (use Freyja_Update for that purpose). If set to true, the `freyja_lineage_metadata` and `freyja_usher_barcodes` files are not required. | FALSE | Optional |
| freyja_fastq | **depth_cutoff** | Int | The minimum coverage depth with which to exclude sites below this value and group identical barcodes | 10 | Optional |
| freyja_fastq | **kraken2_target_organism** | String | The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. | "Severe acute respiratory syndrome coronavirus 2" | Optional |
| freyja_fastq | **ont** | Boolean | Indicates if the input data is derived from an ONT instrument. | FALSE | Optional |
| freyja_fastq | **read2** | File | The raw reverse-facing FASTQ file (Illumina only) | | Optional |
| freyja_fastq | **trimmomatic_minlen** | Int | The minimum length cut-off when performing read cleaning | 25 | Optional |
Expand Down Expand Up @@ -371,8 +372,8 @@ The main output file used in subsequent Freyja workflows is found under the `fre
| kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal | ONT, PE, SE |
| kraken_report | File | Full Kraken report | ONT, PE, SE |
| kraken_report_dehosted | File | Full Kraken report after host removal | ONT, PE, SE |
| kraken_sc2 | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software | ONT, PE, SE |
| kraken_sc2_dehosted | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal | ONT, PE, SE |
| kraken_sc2 | String | Percent of SARS-CoV-2 read data detected using the Kraken2 software | ONT, PE, SE |
| kraken_sc2_dehosted | String | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal | ONT, PE, SE |
| kraken_version | String | Version of Kraken software used | ONT, PE, SE |
| minimap2_docker | String | Docker image used to run minimap2 | ONT |
| minimap2_version | String | Version of minimap2 used | ONT |
Expand Down
27 changes: 14 additions & 13 deletions docs/workflows/genomic_characterization/theiacov.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,14 +221,14 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| ivar_consensus | **stats_n_coverage_primtrim_memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | SE,PE | HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2 |
| kraken2_dehosted | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **kraken2_db** | String | The database used to run Kraken2 | /kraken2-db | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **kraken2_db** | File | The database used to run Kraken2. Must contain viral and human sequences. | "gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz" | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **read2** | File | Internal component, do not modify | | Do not modify, Optional | CL | sars-cov-2 |
| kraken2_raw | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | CL | sars-cov-2 |
| kraken2_raw | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | CL | sars-cov-2 |
| kraken2_raw | **docker_image** | Int | Docker container used in this task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | CL | sars-cov-2 |
| kraken2_raw | **kraken2_db** | String | The database used to run Kraken2 | /kraken2-db | Optional | CL | sars-cov-2 |
| kraken2_raw | **docker_image** | Int | Docker container used in this task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional | CL | sars-cov-2 |
| kraken2_raw | **kraken2_db** | File | The database used to run Kraken2. Must contain viral and human sequences. | "gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz" | Optional | CL | sars-cov-2 |
| kraken2_raw | **memory** | String | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | CL | sars-cov-2 |
| kraken2_raw | **read_processing** | String | The tool used for trimming of primers from reads. Options are trimmomatic and fastp | trimmomatic | Optional | | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| kraken2_raw | **read2** | File | Internal component, do not modify | | Do not modify, Optional | CL | sars-cov-2 |
Expand Down Expand Up @@ -300,8 +300,8 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| qc_check_task | **gambit_predicted_taxon** | String | Internal component, do not modify | | Do not modify, Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| qc_check_task | **kraken_human** | String | Internal component, do not modify | | Do not modify, Optional | FASTA, ONT, SE | |
| qc_check_task | **kraken_human_dehosted** | String | Internal component, do not modify | | Do not modify, Optional | FASTA, ONT, SE | |
| qc_check_task | **kraken_sc2** | Float | Internal component, do not modify | | Do not modify, Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| qc_check_task | **kraken_sc2_dehosted** | Float | Internal component, do not modify | | Do not modify, Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| qc_check_task | **kraken_sc2** | String | Internal component, do not modify | | Do not modify, Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| qc_check_task | **kraken_sc2_dehosted** | String | Internal component, do not modify | | Do not modify, Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| qc_check_task | **kraken_target_organism** | Float | Internal component, do not modify | | Do not modify, Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| qc_check_task | **kraken_target_organism_dehosted** | Float | Internal component, do not modify | | Do not modify, Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| qc_check_task | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | CL, FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
Expand Down Expand Up @@ -341,7 +341,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| read_QC_trim | **call_midas** | Boolean | True/False variable that determines if the MIDAS task should be called. | TRUE | Optional | PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| read_QC_trim | **downsampling_coverage** | Float | The desired coverage to sub-sample the reads to with RASUSA | 150 | Optional | ONT | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| read_QC_trim | **fastp_args** | String | Additional fastp task arguments | --detect_adapter_for_pe -g -5 20 -3 20 | Optional | PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| read_QC_trim | **kraken_db** | File | The database used to run Kraken2 | /kraken2-db | Optional | PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| read_QC_trim | **kraken_db** | File | The database used to run Kraken2. Must contain viral and human sequences. | "gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz" | Optional | PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| read_QC_trim | **kraken_disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| read_QC_trim | **kraken_memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| read_QC_trim | **midas_db** | File | The database used by the MIDAS task | gs://theiagen-public-files-rp/terra/theiaprok-files/midas/midas_db_v1.2.tar.gz | Optional | PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
Expand Down Expand Up @@ -487,6 +487,7 @@ The `organism_parameters` sub-workflow is the first step in all TheiaCoV workflo
|---|---|---|
| gene_locations_bed_file | sars-cov-2 | `"gs://theiagen-public-files-rp/terra/sars-cov-2-files/sc2_gene_locations.bed"` |
| genome_length_input | sars-cov-2 | `29903` |
| kraken_target_organism_input | sars-cov-2 | `"Severe acute respiratory syndrome coronavirus 2"` |
| nextclade_dataset_name_input | sars-cov-2 | `"nextstrain/sars-cov-2/wuhan-hu-1/orfs"` |
| nextclade_dataset_tag_input | sars-cov-2 | `"2024-11-19--14-18-53Z"` |
| pangolin_docker_image | sars-cov-2 | `"us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.31 "`|
Expand Down Expand Up @@ -580,7 +581,7 @@ The `organism_parameters` sub-workflow is the first step in all TheiaCoV workflo
| **Overwrite Variable Name** | **Organism** | **Default Value** |
|---|---|---|
| genome_length_input | rsv_a | 16000 |
| kraken_target_organism | rsv_a | Respiratory syncytial virus |
| kraken_target_organism | rsv_a | "Human respiratory syncytial virus A" |
| nextclade_dataset_name_input | rsv_a | nextstrain/rsv/a/EPI_ISL_412866 |
| nextclade_dataset_tag_input | rsv_a | "2024-11-27--02-51-00Z" |
| reference_genome | rsv_a | gs://theiagen-public-files-rp/terra/rsv_references/reference_rsv_a.fasta |
Expand All @@ -596,7 +597,7 @@ The `organism_parameters` sub-workflow is the first step in all TheiaCoV workflo
| **Overwrite Variable Name** | **Organism** | **Default Value** |
|---|---|---|
| genome_length_input | rsv_b | 16000 |
| kraken_target_organism | rsv_b | "Human orthopneumovirus" |
| kraken_target_organism | rsv_b | "human respiratory syncytial virus" |
| nextclade_dataset_name_input | rsv_b | nextstrain/rsv/b/EPI_ISL_1653999 |
| nextclade_dataset_tag_input | rsv_b | "2024-11-27--02-51-00Z" |
| reference_genome | rsv_b | gs://theiagen-public-files-rp/terra/rsv_references/reference_rsv_b.fasta |
Expand Down Expand Up @@ -726,7 +727,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the `read_QC_trim` workflow

!!! info "Database-dependent"
TheiaCoV automatically uses a viral-specific Kraken2 database.
TheiaCoV automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at `gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz`

!!! techdetails "Kraken2 Technical Details"
Expand Down Expand Up @@ -776,7 +777,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the `read_QC_trim` workflow

!!! info "Database-dependent"
TheiaCoV automatically uses a viral-specific Kraken2 database.
TheiaCoV automatically uses a viral-specific Kraken2 database. This database was generated in-house from RefSeq's viral sequence collection and human genome GRCh38. It's available at `gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz`

!!! techdetails "Kraken2 Technical Details"
Expand Down Expand Up @@ -1122,8 +1123,8 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken_report | File | Full Kraken report | CL, ONT, PE, SE |
| kraken_report_dehosted | File | Full Kraken report after host removal | CL, ONT, PE |
| kraken_sc2 | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken_sc2_dehosted | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken_sc2 | String | Percent of SARS-CoV-2 read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken_sc2_dehosted | String | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken_target_organism | String | Percent of target organism read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken_target_organism_dehosted | String | Percent of target organism read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken_target_organism_name | String | The name of the target organism; e.g., "Monkeypox" or "Human immunodeficiency virus" | CL, ONT, PE, SE |
Expand Down
1 change: 1 addition & 0 deletions docs/workflows/standalone/ncbi_scrub.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ There are three Kraken2 workflows:
| dehost_pe or dehost_se | **read1** | File | | | Required | PE, SE |
| dehost_pe or dehost_se | **read2** | File | | | Required | PE |
| dehost_pe or dehost_se | **samplename** | String | | | Required | PE, SE |
| dehost_pe or dehost_se | **target_organism** | String | Target organism for Kraken2 reporting | "Severe acute respiratory syndrome coronavirus 2" | Optional | PE, SE |
| kraken2 | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | PE, SE |
| kraken2 | **disk_size** | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional | PE, SE |
| kraken2 | **docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | PE, SE |
Expand Down
Loading

0 comments on commit be1247b

Please sign in to comment.