diff --git a/docs/assets/figures/TheiaProk.png b/docs/assets/figures/TheiaProk.png index 693f902c3..ae6f559b4 100644 Binary files a/docs/assets/figures/TheiaProk.png and b/docs/assets/figures/TheiaProk.png differ diff --git a/docs/workflows/genomic_characterization/theiaprok.md b/docs/workflows/genomic_characterization/theiaprok.md index 188a8a7c5..35422e658 100644 --- a/docs/workflows/genomic_characterization/theiaprok.md +++ b/docs/workflows/genomic_characterization/theiaprok.md @@ -299,7 +299,8 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al | merlin_magic | **agrvate_docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/agrvate:1.0.2--hdfd78af_0 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **assembly_only** | Boolean | Internal component, do not modify | | Do not modify, Optional | ONT, PE, SE | | merlin_magic | **call_poppunk** | Boolean | If "true", runs PopPUNK for GPSC cluster designation for S. pneumoniae | TRUE | Optional | FASTA, ONT, PE, SE | -| merlin_magic | **call_shigeifinder_reads_input** | Boolean | If set to "true", the ShigEiFinder task will run again but using read files as input instead of the assembly file. Input is shown but not used for TheiaProk_FASTA. | FALSE | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **call_shigeifinder_reads_input** | Boolean | If set to "true", the ShigEiFinder task will run again but using read files as input instead of the assembly file. Input is shown but not used for TheiaProk_FASTA. | FALSE | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **call_stxtyper** | Boolean | If set to "true", the StxTyper task will run on all samples regardless of the `gambit_predicted_taxon` output. Useful if you suspect a non-E.coli or non-Shigella sample contains stx genes. | FALSE | Optional | FASTA, ONT, PE, SE | | merlin_magic | **cauris_cladetyper_docker_image** | String | Internal component, do not modify | | Do not modify, Optional | FASTA, ONT, PE, SE | | merlin_magic | **cladetyper_kmer_size** | Int | Internal component, do not modify | | Do not modify, Optional | FASTA, ONT, PE, SE | | merlin_magic | **cladetyper_ref_clade1** | File | *Provide an empty file if running TheiaProk on the command-line | | Do not modify, Optional | FASTA, ONT, PE, SE | @@ -373,8 +374,8 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al | merlin_magic | **serotypefinder_docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/serotypefinder:2.0.1 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **shigatyper_docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/shigatyper:2.0.5 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **shigeifinder_docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/shigeifinder:1.3.5 | Optional | FASTA, ONT, PE, SE | -| merlin_magic | **sistr_cpu** | Int | The number of CPU cores to allocate for the task. | 8 | Optional | FASTA, ONT, PE, SE | -| merlin_magic | **sistr_disk_size** | Int | The disk size (in GB) to allocate for the task. | 100 | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **sistr_cpu** | Int | The number of CPU cores to allocate for the task | 8 | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **sistr_disk_size** | Int | The disk size (in GB) to allocate for the task | 100 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **sistr_docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/sistr_cmd:1.1.1--pyh864c0ab_2 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **sistr_memory** | Int | The amount of memory (in GB) to allocate for the task. | 32 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **sistr_use_full_cgmlst_db** | Boolean | Set to true to use the full set of cgMLST alleles which can include highly similar alleles. By default the smaller "centroid" alleles or representative alleles are used for each marker | False | Optional | FASTA, ONT, PE, SE | @@ -400,6 +401,11 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al | merlin_magic | **srst2_min_cov** | Int | Minimum breadth of coverage for SRST2 to call a gene as present | 80 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **srst2_min_depth** | Int | Minimum depth of coverage for SRST2 to call a gene as present | 5 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **srst2_min_edge_depth** | Int | Minimum edge depth for SRST2 to call a gene as present | 2 | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **stxtyper_cpu** | Int | The number of CPU cores to allocate for the task. | 1 | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **stxtyper_disk_size** | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **stxtyper_docker_image** | String | The Docker container to use for the task | `us-docker.pkg.dev/general-theiagen/staphb/stxtyper:1.0.24` | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **stxtyper_enable_debug** | Boolean | When enabled, additional messages are printed and files in `$TMPDIR` are not removed after running | FALSE | Optional | FASTA, ONT, PE, SE | +| merlin_magic | **stxtyper_memory** | Int | Amount of memory (in GB) to allocate to the task | 4 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **staphopia_sccmec_docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/staphopia-sccmec:1.0.0--hdfd78af_0 | Optional | FASTA, ONT, PE, SE | | merlin_magic | **tbp_parser_coverage_regions_bed** | File | A bed file that lists the regions to be considered for QC | | Optional | FASTA, ONT, PE, SE | | merlin_magic | **tbp_parser_coverage_threshold** | Int | The minimum coverage for a region to pass QC in tbp_parser | 100 | Optional | FASTA, ONT, PE, SE | @@ -1131,8 +1137,8 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after NCBI's AMRFinderPlus, which is implemented as a core module in TheiaProk, detects the *bla*OXA-51-like genes. This may be used to confirm the species, in addition to the GAMBIT taxon identification. The *bla*OXA-51-like genes act as carbapenemases when an IS*Aba1* is found 7 bp upstream of the gene. Detection of this IS is not currently undertaken in TheiaProk. -??? toggle "_Escherichia_ or _Shigella_ spp" - ##### _Escherichia_ or _Shigella_ spp {#escherichia-or-shigella} +??? toggle "_Escherichia_ or _Shigella_ spp." + ##### _Escherichia_ or _Shigella_ spp. {#escherichia-or-shigella} The *Escherichia* and *Shigella* genera are [difficult to differentiate as they do not comply with genomic definitions of genera and species](https://www.sciencedirect.com/science/article/abs/pii/S1286457902016374). Consequently, when either _Escherichia_ or _Shigella_ are identified by GAMBIT, all tools intended for these taxa are used. @@ -1183,7 +1189,7 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after ??? task "`ShigaTyper`: *Shigella*/EIEC differentiation and serotyping ==_for Illumina and ONT only_==" - ShigaTyper predicts *Shigella* spp serotypes from Illumina or ONT read data. If the genome is not *Shigella* or EIEC, the results from this tool will state this. In the notes it provides, it also reports on the presence of *ipaB* which is suggestive of the presence of the "virulent invasion plasmid". + ShigaTyper predicts *Shigella* spp. serotypes from Illumina or ONT read data. If the genome is not *Shigella* or EIEC, the results from this tool will state this. In the notes it provides, it also reports on the presence of *ipaB* which is suggestive of the presence of the "virulent invasion plasmid". !!! techdetails "ShigaTyper Technical Details" @@ -1238,6 +1244,30 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after **Shigella XDR prediction.** Please see the documentation section above for ResFinder for details regarding this taxa-specific analysis. + ??? task "`StxTyper`: Identification and typing of Shiga toxin (Stx) genes ==_using the assembly file as input_==" + + StxTyper screens bacterial genome assemblies for shiga toxin genes and subtypes them into known subtypes and also looks for novel subtypes in cases where the detected sequences diverge from the reference sequences. + + Shiga toxin is the main virulence factor of Shiga-toxin-producing E. coli (STEC), though these genes are also found in Shigella species as well as some other genera more rarely, such as Klebsiella. [Please see this review paper that describes shiga toxins in great detail.](https://doi.org/10.3390/microorganisms12040687) + + !!! tip "Running StxTyper via the TheiaProk workflows" + The TheiaProk workflow will automatically run `stxtyper` on all E. coli and Shigella spp. samples, but ==*the user can opt-in to running the tool on any sample by setting the optional input variable `call_stxtyper` to `true` when configuring the workflow.*== + + Generally, `stxtyper` looks for _stxA_ and _stxB_ subunits that compose a complete operon. The A subunit is longer (in amino acid length) than the B subunit. Stxtyper attempts to detect these, compare them to a database of known sequences, and type them based on amino acid composition. There typing algorithm and rules defining how to type these genes & operons will be described more completely in a publication that will be available in the future. + + The `stxtyper_report` output TSV is provided in [this output format.](https://github.com/ncbi/stxtyper/tree/v1.0.24?tab=readme-ov-file#output) + + Eventually this tool will be incorporated into AMRFinderPlus and will run behind-the-scenes when the user (or in this case, the TheiaProk workflow) provides the `amrfinder --organism Escherichia` option. + + !!! techdetails "StxTyper Technical Details" + + | | Links | + | --- | --- | + | Task | [task_stxtyper.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/escherichia_shigella/task_stxtyper.wdl) | + | Software Source Code | [ncbi/stxtyper GitHub repository](https://github.com/ncbi/stxtyper) | + | Software Documentation | [ncbi/stxtyper GitHub repository](https://github.com/ncbi/stxtyper) | + | Original Publication(s) | No publication currently available, as this is a new tool. One will be available in the future. | + ??? toggle "_Haemophilus influenzae_" ##### _Haemophilus influenzae_ {#haemophilus-influenzae} ??? task "`hicap`: Sequence typing" @@ -1256,8 +1286,8 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after | Software Documentation | [hicap on GitHub](https://github.com/scwatts/hicap) | | Original Publication(s) | [hicap: In Silico Serotyping of the Haemophilus influenzae Capsule Locus](https://doi.org/10.7717/peerj.5261) | -??? toggle "_Klebsiella_ spp" - ##### _Klebsiella_ spp {#klebsiella} +??? toggle "_Klebsiella_ spp." + ##### _Klebsiella_ spp. {#klebsiella} ??? task "`Kleborate`: Species identification, MLST, serotyping, AMR and virulence characterization" [Kleborate](https://github.com/katholt/Kleborate) is a tool to identify the *Klebsiella* species, MLST sequence type, serotype, virulence factors (ICE*Kp* and plasmid associated), and AMR genes and mutations. Serotyping is based on the capsular (K antigen) and lipopolysaccharide (LPS) (O antigen) genes. The resistance genes identified by Kleborate are described [here](https://github.com/katholt/Kleborate/wiki/Antimicrobial-resistance). @@ -1338,8 +1368,8 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after | Software Source Code | [clockwork](https://github.com/iqbal-lab-org/clockwork) | | Software Documentation | | -??? toggle "_Neisseria_ spp" - ##### _Neisseria_ spp {#neisseria} +??? toggle "_Neisseria_ spp." + ##### _Neisseria_ spp. {#neisseria} ??? task "`ngmaster`: _Neisseria gonorrhoeae_ sequence typing" NG-MAST is currently the most widely used method for epidemiological surveillance of *Neisseria gonorrhoea.* This tool is targeted at clinical and research microbiology laboratories that have performed WGS of *N. gonorrhoeae* isolates and wish to understand the molecular context of their data in comparison to previously published epidemiological studies. As WGS becomes more routinely performed, *NGMASTER* @@ -1386,14 +1416,14 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after | Software Documentation | [pasty](https://github.com/rpetit3/pasty) | | Original Publication(s) | [Application of Whole-Genome Sequencing Data for O-Specific Antigen Analysis and In Silico Serotyping of Pseudomonas aeruginosa Isolates.](https://journals.asm.org/doi/10.1128/JCM.00349-16) | -??? toggle "_Salmonella_ spp" - ##### _Salmonella_ spp {#salmonella} +??? toggle "_Salmonella_ spp." + ##### _Salmonella_ spp. {#salmonella} Both SISTR and SeqSero2 are used for serotyping all *Salmonella* spp. Occasionally, the predicted serotypes may differ between SISTR and SeqSero2. When this occurs, differences are typically small and analogous, and are likely as a result of differing source databases. More information about Salmonella serovar nomenclature can be found [here](https://www.happykhan.com/posts/binfie-guide-serovar/). For *Salmonella* Typhi, genotyphi is additionally run for further typing. ??? task "`SISTR`: Salmonella serovar prediction" - [SISTR](https://github.com/phac-nml/sistr_cmd) performs *Salmonella spp* serotype prediction using antigen gene and cgMLST gene alleles. In TheiaProk. SISTR is run on genome assemblies, and uses the default database setting (smaller "centroid" alleles or representative alleles instead of the full set of cgMLST alleles). It also runs a QC mode to determine the level of confidence in the serovar prediction (see [here](https://github.com/phac-nml/sistr_cmd#qc-by-sistr_cmd---qc)). + [SISTR](https://github.com/phac-nml/sistr_cmd) performs *Salmonella spp.* serotype prediction using antigen gene and cgMLST gene alleles. In TheiaProk. SISTR is run on genome assemblies, and uses the default database setting (smaller "centroid" alleles or representative alleles instead of the full set of cgMLST alleles). It also runs a QC mode to determine the level of confidence in the serovar prediction (see [here](https://github.com/phac-nml/sistr_cmd#qc-by-sistr_cmd---qc)). !!! techdetails "SISTR Technical Details" @@ -1534,11 +1564,11 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after | Software Source Code | [emm-typing-tool](https://github.com/ukhsa-collaboration/emm-typing-tool) | | Software Documentation | [emm-typing-tool](https://github.com/ukhsa-collaboration/emm-typing-tool) | -??? toggle "_Vibrio_ spp" - ##### _Vibrio_ spp {#vibrio} +??? toggle "_Vibrio_ spp." + ##### _Vibrio_ spp. {#vibrio} ??? task "`SRST2`: Vibrio characterization ==_for Illumina only_==" - The `SRST2 Vibrio characterization` task detects sequences for *Vibrio* spp characterization using Illumina sequence reads and a database of target sequence that are traditionally used in PCR methods. The sequences included in the database are as follows: + The `SRST2 Vibrio characterization` task detects sequences for *Vibrio* spp. characterization using Illumina sequence reads and a database of target sequence that are traditionally used in PCR methods. The sequences included in the database are as follows: | Sequence name | Sequence role | Purpose in database | | --- | --- | --- | @@ -1561,7 +1591,7 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after ??? task "`Abricate`: Vibrio characterization" - The `Abricate` Vibrio characterization task detects sequences for *Vibrio* spp characterization using genome assemblies and the abricate "vibrio" database. The sequences included in the database are as follows: + The `Abricate` Vibrio characterization task detects sequences for *Vibrio* spp. characterization using genome assemblies and the abricate "vibrio" database. The sequences included in the database are as follows: | Sequence name | Sequence role | Purpose in database | | --- | --- | --- | @@ -1865,7 +1895,7 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after | resfinder_results | File | Predicted resistance genes grouped by antibiotic class | FASTA, ONT, PE, SE | | resfinder_seqs | File | FASTA of resistance gene sequences from user’s input sequence | FASTA, ONT, PE, SE | | seq_platform | String | Sequencing platform input by the user | FASTA, ONT, PE, SE | -| seqsero2_predicted_antigenic_profile | String | Antigenic profile predicted for Salmonella spp by SeqSero2 | ONT, PE, SE | +| seqsero2_predicted_antigenic_profile | String | Antigenic profile predicted for Salmonella spp. by SeqSero2 | ONT, PE, SE | | seqsero2_predicted_contamination | String | Indicates whether contamination between Salmonella with different serotypes was predicted by SeqSero2 | ONT, PE, SE | | seqsero2_predicted_serotype | String | Serotype predicted by SeqSero2 | ONT, PE, SE | | seqsero2_report | File | TSV report produced by SeqSero2 | ONT, PE, SE | @@ -1940,6 +1970,15 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after | staphopiasccmec_results_tsv | File | sccmec types and mecA presence | FASTA, ONT, PE, SE | | staphopiasccmec_types_and_mecA_presence | String | staphopia-sccmec Hamming distance file | FASTA, ONT, PE, SE | | staphopiasccmec_version | String | staphopia-sccmec presence and absence TSV file | FASTA, ONT, PE, SE | +| stxtyper_all_hits | String | Comma-separated list of matches of all types. Includes complete, partial, frameshift, internal stop, and novel hits. List is de-duplicated so multiple identical hits are only listed once. For example if 5 partial stx2 hits are detected in the genome, only 1 "stx2" will be listed in this field. To view the potential subtype for each partial hit, the user will need to view the stxtyper_report TSV file. | FASTA, ONT, PE, SE | +| stxtyper_complete_operons | String | Comma-separated list of all COMPLETE operons detected by StxTyper. Show multiple hits if present in results. | FASTA, ONT, PE, SE | +| stxtyper_docker | String | Name of docker image used by the stxtyper task. | FASTA, ONT, PE, SE | +| stxtyper_novel_hits | String | Comma-separated list of matches that have the OPERON output of "COMPLETE_NOVEL". Possible outputs "stx1", "stx2", or "stx1,stx2" | FASTA, ONT, PE, SE | +| stxtyper_num_hits | Int | Number of "hits" or rows present in the `stxtyper_report` TSV file | FASTA, ONT, PE, SE | +| stxtyper_partial_hits | String | Possible outputs "stx1", "stx2", or "stx1,stx2". Tells the user that there was a partial hit to either the A or B subunit, but does not describe which subunit, only the possible types from the PARTIAL matches. | FASTA, ONT, PE, SE | +| stxtyper_report | File | Raw results TSV file produced by StxTyper | FASTA, ONT, PE, SE | +| stxtyper_stx_frameshifts_or_internal_stop_hits | String | Comma-separated list of matches that have the OPERON output of "FRAMESHIFT" or "INTERNAL_STOP". Possible outputs "stx1", "stx2", or "stx1,stx2" | FASTA, ONT, PE, SE | +| stxtyper_version | String | Version of StxTyper used | FASTA, ONT, PE, SE | | taxon_table_status | String | Status of the taxon table upload | FASTA, ONT, PE, SE | | tbp_parser_average_genome_depth | Float | Optional output. Average genome depth across the reference genome | ONT, PE, SE | | tbp_parser_coverage_report | File | Optional output. TSV file with breadth of coverage of each gene associated with antimicrobial resistance in mycobacterium tuberculosis. | ONT, PE, SE | diff --git a/tasks/species_typing/escherichia_shigella/task_stxtyper.wdl b/tasks/species_typing/escherichia_shigella/task_stxtyper.wdl new file mode 100644 index 000000000..f066ad109 --- /dev/null +++ b/tasks/species_typing/escherichia_shigella/task_stxtyper.wdl @@ -0,0 +1,122 @@ +version 1.0 + +task stxtyper { + input { + File assembly + String samplename + Boolean enable_debugging = false # Additional messages are printed and files in $TMPDIR are not removed after running + String docker = "us-docker.pkg.dev/general-theiagen/staphb/stxtyper:1.0.24" + Int disk_size = 50 + Int cpu = 1 + Int memory = 4 + } + command <<< + # fail task if any commands below fail since there's lots of bash conditionals below (AGH!) + set -eo pipefail + + # capture version info + stxtyper --version | tee VERSION.txt + + # NOTE: by default stxyper uses $TMPDIR or /tmp, so if we run into issues we may need to adjust in the future. Could potentially use PWD as the TMPDIR. + echo "DEBUG: TMPDIR is set to: $TMPDIR" + + echo "DEBUG: running StxTyper now..." + # run StxTyper on assembly; may need to add/remove options in the future if they change + # NOTE: stxtyper can accept gzipped assemblies, so no need to unzip + stxtyper \ + --nucleotide ~{assembly} \ + --name ~{samplename} \ + --output ~{samplename}_stxtyper.tsv \ + ~{true='--debug' false='' enable_debugging} \ + --log ~{samplename}_stxtyper.log + + # parse output TSV + echo "DEBUG: Parsing StxTyper output TSV..." + + # check for output file with only 1 line (meaning no hits found); exit cleanly if so + if [ "$(wc -l < ~{samplename}_stxtyper.tsv)" -eq 1 ]; then + echo "No hits found by StxTyper" > stxtyper_hits.txt + echo "0" > stxtyper_num_hits.txt + echo "DEBUG: No hits found in StxTyper output TSV. Exiting task with exit code 0 now." + + # create empty output files + touch stxtyper_all_hits.txt stxtyper_complete_operons.txt stxtyper_partial_hits.txt stxtyper_stx_frameshifts_or_internal_stop_hits.txt stx_novel_hits.txt + # put "none" into all of them so task does not fail + echo "None" | tee stxtyper_all_hits.txt stxtyper_complete_operons.txt stxtyper_partial_hits.txt stxtyper_stx_frameshifts_or_internal_stop_hits.txt stx_novel_hits.txt + exit 0 + fi + + # check for output file with more than 1 line (meaning hits found); count lines & parse output TSV if so + if [ "$(wc -l < ~{samplename}_stxtyper.tsv)" -gt 1 ]; then + echo "Hits found by StxTyper. Counting lines & parsing output TSV now..." + # count number of lines in output TSV (excluding header) + wc -l < ~{samplename}_stxtyper.tsv | awk '{print $1-1}' > stxtyper_num_hits.txt + # remove header line + sed '1d' ~{samplename}_stxtyper.tsv > ~{samplename}_stxtyper_noheader.tsv + + ##### parse output TSV ##### + ### complete operons + echo "DEBUG: Parsing complete operons..." + awk -F'\t' -v OFS=, '$4 == "COMPLETE" {print $3}' ~{samplename}_stxtyper.tsv | paste -sd, - | tee stxtyper_complete_operons.txt + # if grep for COMPLETE fails, write "None" to file for output string + if [[ "$(grep --silent 'COMPLETE' ~{samplename}_stxtyper.tsv; echo $?)" -gt 0 ]]; then + echo "None" > stxtyper_complete_operons.txt + fi + + ### complete_novel operons + echo "DEBUG: Parsing complete novel hits..." + awk -F'\t' -v OFS=, '$4 == "COMPLETE_NOVEL" {print $3}' ~{samplename}_stxtyper.tsv | paste -sd, - | tee stx_novel_hits.txt + # if grep for COMPLETE_NOVEL fails, write "None" to file for output string + if [ "$(grep --silent 'COMPLETE_NOVEL' ~{samplename}_stxtyper.tsv; echo $?)" -gt 0 ]; then + echo "None" > stx_novel_hits.txt + fi + + ### partial hits (to any gene in stx operon) + echo "DEBUG: Parsing stxtyper partial hits..." + # explanation: if "operon" column contains "PARTIAL" (either PARTIAL or PARTIAL_CONTIG_END possible); print either "stx1" or "stx2" or "stx1,stx2" + awk -F'\t' -v OFS=, '$4 ~ "PARTIAL.*" {print $3}' ~{samplename}_stxtyper.tsv | sort | uniq | paste -sd, - | tee stxtyper_partial_hits.txt + # if no stx partial hits found, write "None" to file for output string + if [ "$(grep --silent 'stx' stxtyper_partial_hits.txt; echo $?)" -gt 0 ]; then + echo "None" > stxtyper_partial_hits.txt + fi + + ### frameshifts or internal stop codons in stx genes + echo "DEBUG: Parsing stx frameshifts or internal stop codons..." + # explanation: if operon column contains "FRAME_SHIFT" or "INTERNAL_STOP", print the "operon" in a sorted/unique list + awk -F'\t' -v OFS=, '$4 == "FRAMESHIFT" || $4 == "INTERNAL_STOP" {print $3}' ~{samplename}_stxtyper.tsv | sort | uniq | paste -sd, - | tee stxtyper_stx_frameshifts_or_internal_stop_hits.txt + # if no frameshifts or internal stop codons found, write "None" to file for output string + if [ "$(grep --silent -E 'FRAMESHIFT|INTERNAL_STOP' ~{samplename}_stxtyper.tsv; echo $?)" -gt 0 ]; then + echo "None" > stxtyper_stx_frameshifts_or_internal_stop_hits.txt + fi + + echo "DEBUG: generating stx_type_all string output now..." + # sort and uniq so there are no duplicates; then paste into a single comma-separated line with commas + # sed is to remove any instances of "None" from the output + cat stxtyper_complete_operons.txt stxtyper_partial_hits.txt stxtyper_stx_frameshifts_or_internal_stop_hits.txt stx_novel_hits.txt | sed '/None/d' | sort | uniq | paste -sd, - > stxtyper_all_hits.txt + + fi + echo "DEBUG: Finished parsing StxTyper output TSV." + >>> + output { + File stxtyper_report = "~{samplename}_stxtyper.tsv" + File stxtyper_log = "~{samplename}_stxtyper.log" + String stxtyper_docker = docker + String stxtyper_version = read_string("VERSION.txt") + # outputs parsed from stxtyper output TSV + Int stxtyper_num_hits = read_int("stxtyper_num_hits.txt") + String stxtyper_all_hits = read_string("stxtyper_all_hits.txt") + String stxtyper_complete_operon_hits = read_string("stxtyper_complete_operons.txt") + String stxtyper_partial_hits = read_string("stxtyper_partial_hits.txt") + String stxtyper_frameshifts_or_internal_stop_hits = read_string("stxtyper_stx_frameshifts_or_internal_stop_hits.txt") + String stxtyper_novel_hits = read_string("stx_novel_hits.txt") + } + runtime { + docker: "~{docker}" + memory: "~{memory} GB" + cpu: cpu + disks: "local-disk " + disk_size + " SSD" + disk: disk_size + " GB" + preemptible: 1 # does not take long (usually <3 min) to run stxtyper on 1 genome, preemptible is fine + maxRetries: 3 + } +} diff --git a/tests/workflows/theiaprok/test_wf_theiaprok_illumina_pe.yml b/tests/workflows/theiaprok/test_wf_theiaprok_illumina_pe.yml index 91ae801b7..71f5bd4a2 100644 --- a/tests/workflows/theiaprok/test_wf_theiaprok_illumina_pe.yml +++ b/tests/workflows/theiaprok/test_wf_theiaprok_illumina_pe.yml @@ -631,9 +631,9 @@ - path: miniwdl_run/wdl/tasks/utilities/data_export/task_broad_terra_tools.wdl md5sum: 4d69a6539b68503af9f3f1c2787ff920 - path: miniwdl_run/wdl/workflows/theiaprok/wf_theiaprok_illumina_pe.wdl - md5sum: 6d9dd969e2144ca23f2a0e101e6b6966 + md5sum: 3cb5c86b15e931b0c0b98ed784386438 - path: miniwdl_run/wdl/workflows/utilities/wf_merlin_magic.wdl - md5sum: 670f990128063eb3c7b3fa49302f08b7 + md5sum: ea5cff6eff8c2c42046cf2eae6f16b6f - path: miniwdl_run/wdl/workflows/utilities/wf_read_QC_trim_pe.wdl contains: ["version", "QC", "output"] - path: miniwdl_run/workflow.log diff --git a/tests/workflows/theiaprok/test_wf_theiaprok_illumina_se.yml b/tests/workflows/theiaprok/test_wf_theiaprok_illumina_se.yml index 82f9a9a74..88584182b 100644 --- a/tests/workflows/theiaprok/test_wf_theiaprok_illumina_se.yml +++ b/tests/workflows/theiaprok/test_wf_theiaprok_illumina_se.yml @@ -594,9 +594,9 @@ - path: miniwdl_run/wdl/tasks/utilities/data_export/task_broad_terra_tools.wdl md5sum: 4d69a6539b68503af9f3f1c2787ff920 - path: miniwdl_run/wdl/workflows/theiaprok/wf_theiaprok_illumina_se.wdl - md5sum: 5aa25e4fad466f92c96a7c138aca0d20 + md5sum: fdb66b59ac886501a4ae90a25cefd633 - path: miniwdl_run/wdl/workflows/utilities/wf_merlin_magic.wdl - md5sum: 670f990128063eb3c7b3fa49302f08b7 + md5sum: ea5cff6eff8c2c42046cf2eae6f16b6f - path: miniwdl_run/wdl/workflows/utilities/wf_read_QC_trim_se.wdl md5sum: d11bfe33fdd96eab28892be5a01c1c7d - path: miniwdl_run/workflow.log diff --git a/workflows/theiaprok/wf_theiaprok_fasta.wdl b/workflows/theiaprok/wf_theiaprok_fasta.wdl index 3735bda10..4ce8e5cb2 100644 --- a/workflows/theiaprok/wf_theiaprok_fasta.wdl +++ b/workflows/theiaprok/wf_theiaprok_fasta.wdl @@ -576,6 +576,16 @@ workflow theiaprok_fasta { File? virulencefinder_report_tsv = merlin_magic.virulencefinder_report_tsv String? virulencefinder_docker = merlin_magic.virulencefinder_docker String? virulencefinder_hits = merlin_magic.virulencefinder_hits + # stxtyper + File? stxtyper_report = merlin_magic.stxtyper_report + String? stxtyper_docker = merlin_magic.stxtyper_docker + String? stxtyper_version = merlin_magic.stxtyper_version + Int? stxtyper_num_hits = merlin_magic.stxtyper_num_hits + String? stxtyper_all_hits = merlin_magic.stxtyper_all_hits + String? stxtyper_complete_operons = merlin_magic.stxtyper_complete_operon_hits + String? stxtyper_partial_hits = merlin_magic.stxtyper_partial_hits + String? stxtyper_stx_frameshifts_or_internal_stop_hits = merlin_magic.stxtyper_stx_frameshifts_or_internal_stop_hits + String? stxtyper_novel_hits = merlin_magic.stxtyper_novel_hits # Listeria Typing File? lissero_results = merlin_magic.lissero_results String? lissero_version = merlin_magic.lissero_version diff --git a/workflows/theiaprok/wf_theiaprok_illumina_pe.wdl b/workflows/theiaprok/wf_theiaprok_illumina_pe.wdl index fd2e7dea1..d71c5e324 100644 --- a/workflows/theiaprok/wf_theiaprok_illumina_pe.wdl +++ b/workflows/theiaprok/wf_theiaprok_illumina_pe.wdl @@ -819,6 +819,16 @@ workflow theiaprok_illumina_pe { File? virulencefinder_report_tsv = merlin_magic.virulencefinder_report_tsv String? virulencefinder_docker = merlin_magic.virulencefinder_docker String? virulencefinder_hits = merlin_magic.virulencefinder_hits + # stxtyper + File? stxtyper_report = merlin_magic.stxtyper_report + String? stxtyper_docker = merlin_magic.stxtyper_docker + String? stxtyper_version = merlin_magic.stxtyper_version + Int? stxtyper_num_hits = merlin_magic.stxtyper_num_hits + String? stxtyper_all_hits = merlin_magic.stxtyper_all_hits + String? stxtyper_complete_operons = merlin_magic.stxtyper_complete_operon_hits + String? stxtyper_partial_hits = merlin_magic.stxtyper_partial_hits + String? stxtyper_stx_frameshifts_or_internal_stop_hits = merlin_magic.stxtyper_stx_frameshifts_or_internal_stop_hits + String? stxtyper_novel_hits = merlin_magic.stxtyper_novel_hits # Shigella sonnei Typing File? sonneityping_mykrobe_report_csv = merlin_magic.sonneityping_mykrobe_report_csv File? sonneityping_mykrobe_report_json = merlin_magic.sonneityping_mykrobe_report_json diff --git a/workflows/theiaprok/wf_theiaprok_illumina_se.wdl b/workflows/theiaprok/wf_theiaprok_illumina_se.wdl index 0e00e3ac8..1c3eee081 100644 --- a/workflows/theiaprok/wf_theiaprok_illumina_se.wdl +++ b/workflows/theiaprok/wf_theiaprok_illumina_se.wdl @@ -758,6 +758,16 @@ workflow theiaprok_illumina_se { File? virulencefinder_report_tsv = merlin_magic.virulencefinder_report_tsv String? virulencefinder_docker = merlin_magic.virulencefinder_docker String? virulencefinder_hits = merlin_magic.virulencefinder_hits + # stxtyper + File? stxtyper_report = merlin_magic.stxtyper_report + String? stxtyper_docker = merlin_magic.stxtyper_docker + String? stxtyper_version = merlin_magic.stxtyper_version + Int? stxtyper_num_hits = merlin_magic.stxtyper_num_hits + String? stxtyper_all_hits = merlin_magic.stxtyper_all_hits + String? stxtyper_complete_operons = merlin_magic.stxtyper_complete_operon_hits + String? stxtyper_partial_hits = merlin_magic.stxtyper_partial_hits + String? stxtyper_stx_frameshifts_or_internal_stop_hits = merlin_magic.stxtyper_stx_frameshifts_or_internal_stop_hits + String? stxtyper_novel_hits = merlin_magic.stxtyper_novel_hits # Shigella sonnei Typing File? sonneityping_mykrobe_report_csv = merlin_magic.sonneityping_mykrobe_report_csv File? sonneityping_mykrobe_report_json = merlin_magic.sonneityping_mykrobe_report_json diff --git a/workflows/theiaprok/wf_theiaprok_ont.wdl b/workflows/theiaprok/wf_theiaprok_ont.wdl index fecfb744e..a7eb9143e 100644 --- a/workflows/theiaprok/wf_theiaprok_ont.wdl +++ b/workflows/theiaprok/wf_theiaprok_ont.wdl @@ -730,6 +730,16 @@ workflow theiaprok_ont { File? virulencefinder_report_tsv = merlin_magic.virulencefinder_report_tsv String? virulencefinder_docker = merlin_magic.virulencefinder_docker String? virulencefinder_hits = merlin_magic.virulencefinder_hits + # stxtyper + File? stxtyper_report = merlin_magic.stxtyper_report + String? stxtyper_docker = merlin_magic.stxtyper_docker + String? stxtyper_version = merlin_magic.stxtyper_version + Int? stxtyper_num_hits = merlin_magic.stxtyper_num_hits + String? stxtyper_all_hits = merlin_magic.stxtyper_all_hits + String? stxtyper_complete_operons = merlin_magic.stxtyper_complete_operon_hits + String? stxtyper_partial_hits = merlin_magic.stxtyper_partial_hits + String? stxtyper_stx_frameshifts_or_internal_stop_hits = merlin_magic.stxtyper_stx_frameshifts_or_internal_stop_hits + String? stxtyper_novel_hits = merlin_magic.stxtyper_novel_hits # Shigella sonnei Typing File? sonneityping_mykrobe_report_csv = merlin_magic.sonneityping_mykrobe_report_csv File? sonneityping_mykrobe_report_json = merlin_magic.sonneityping_mykrobe_report_json diff --git a/workflows/utilities/wf_merlin_magic.wdl b/workflows/utilities/wf_merlin_magic.wdl index 1d6914e79..f10060851 100644 --- a/workflows/utilities/wf_merlin_magic.wdl +++ b/workflows/utilities/wf_merlin_magic.wdl @@ -8,6 +8,7 @@ import "../../tasks/species_typing/escherichia_shigella/task_serotypefinder.wdl" import "../../tasks/species_typing/escherichia_shigella/task_shigatyper.wdl" as shigatyper_task import "../../tasks/species_typing/escherichia_shigella/task_shigeifinder.wdl" as shigeifinder_task import "../../tasks/species_typing/escherichia_shigella/task_sonneityping.wdl" as sonneityping_task +import "../../tasks/species_typing/escherichia_shigella/task_stxtyper.wdl" as stxtyper_task import "../../tasks/species_typing/escherichia_shigella/task_virulencefinder.wdl" as virulencefinder_task import "../../tasks/species_typing/haemophilus/task_hicap.wdl" as hicap_task import "../../tasks/species_typing/klebsiella/task_kleborate.wdl" as kleborate_task @@ -218,6 +219,13 @@ workflow merlin_magic { Float? virulencefinder_coverage_threshold Float? virulencefinder_identity_threshold String? virulencefinder_database + # stxtyper options + Boolean call_stxtyper = false # set to true to run stxtyper on any bacterial sample + Boolean? stxtyper_enable_debug + String? stxtyper_docker_image + Int? stxtyper_disk_size + Int? stxtyper_cpu + Int? stxtyper_memory } # theiaprok if (merlin_tag == "Acinetobacter baumannii") { @@ -241,6 +249,19 @@ workflow merlin_magic { docker = abricate_abaum_docker_image } } + # stxtyper is special & in it's own conditional block because it should automatically be run on Escherichia and Shigella species; but optionally run on ANY bacterial sample if the user wants to screen for Shiga toxin genes + if (merlin_tag == "Escherichia" || merlin_tag == "Shigella sonnei" || call_stxtyper == true ) { + call stxtyper_task.stxtyper { + input: + assembly = assembly, + samplename = samplename, + docker = stxtyper_docker_image, + disk_size = stxtyper_disk_size, + cpu = stxtyper_cpu, + memory = stxtyper_memory, + enable_debugging = stxtyper_enable_debug + } + } if (merlin_tag == "Escherichia" || merlin_tag == "Shigella sonnei" ) { # tools specific to ALL Escherichia and Shigella species # @@ -755,6 +776,16 @@ workflow merlin_magic { File? virulencefinder_report_tsv = virulencefinder.virulencefinder_report_tsv String? virulencefinder_docker = virulencefinder.virulencefinder_docker String? virulencefinder_hits = virulencefinder.virulencefinder_hits + # stxtyper + File? stxtyper_report = stxtyper.stxtyper_report + String? stxtyper_docker = stxtyper.stxtyper_docker + String? stxtyper_version = stxtyper.stxtyper_version + Int? stxtyper_num_hits = stxtyper.stxtyper_num_hits + String? stxtyper_all_hits = stxtyper.stxtyper_all_hits + String? stxtyper_complete_operon_hits = stxtyper.stxtyper_complete_operon_hits + String? stxtyper_partial_hits = stxtyper.stxtyper_partial_hits + String? stxtyper_stx_frameshifts_or_internal_stop_hits = stxtyper.stxtyper_frameshifts_or_internal_stop_hits + String? stxtyper_novel_hits = stxtyper.stxtyper_novel_hits # Shigella sonnei Typing File? sonneityping_mykrobe_report_csv = sonneityping.sonneityping_mykrobe_report_csv File? sonneityping_mykrobe_report_json = sonneityping.sonneityping_mykrobe_report_json