Skip to content

Releases: theiagen/public_health_bioinformatics

v2.3.0

19 Dec 20:43
f81fdb1
Compare
Choose a tag to compare

Public Health Bioinformatics v2.3.0 Minor Release

This minor release adds two new workflows, Fetch_SRR_Accession_PHB and Concatenate_Illumina_Lanes_PHB, and makes significant improvements to the TheiaCoV, TheiaEuk, TheiaProk, and TheiaMeta workflow series. Documentation updates and various bug fixes have also been implemented.

Full release notes can be found here!

Find our documentation here!

🆕 New workflows

  • Concatenate_Illumina_Lanes_PHB

    • Some Illumina sequencing platforms produce FASTQ files split across multiple lanes for a single sample. This workflow combines multi-lane FASTQ files from Illumina sequencing runs into a single read1 and read2 file per sample. This workflow is ideal for Illumina sequencing outputs where data from multiple lanes must be combined to proceed with analysis workflows such as assembly or variant calling as it ensures that downstream workflows receive consolidated FASTQ files
    • This workflow is designed to run automatically at the start of the TheiaProk workflow if multi-lane FASTQ files are provided (e.g., read1_lane2.fastq.gz and read2_lane2.fastq.gz)
    • Import this workflow from Dockstore
  • Fetch_SRR_Accession_PHB

    • This workflow will retrieve any Sequence Read Archive (SRA) accessions (SRR) associated with a given sample accession, such as a BioSample ID (e.g., "SAMN00000000") or SRA Experiment ID (e.g., "SRX000000").
      • This process utilizes the fastq-dl tool to fetch metadata from SRA and outputs the corresponding SRR accession(s).
      • If multiple SRR accessions are linked to a single sample, the workflow will output them as a comma-separated list.
    • This workflow is particularly useful for retrieving SRR accessions a few days after running Terra_2_NCBI workflows.
    • Import this workflow from Dockstore

🚀 Changes to existing workflows

  • All Genomic Characterization Workflows

    • The read screen is now compatible with Dorado-produced FASTQ files
  • All Illumina Workflows

    • fastq_scan has been updated to the latest version
  • All TheiaCoV Workflows

    • The percentage of mapped reads is now output in all TheiaCoV workflows (except TheiaCoV_FASTA)
    • The default Nextclade dataset tags have been updated for SC2, mpox, flu, RSV-A, and RSV-B
    • The default Pangolin docker is now us-docker.pkg.dev/general-theiagen/staphb/pangolin:4.3.1-pdata-1.31
    • Kraken2 standalone is now used and databases must be provided.
  • TheiaCoV_Illumina_PE and TheiaCoV_ONT

    • Default parameters have been set for H5N1 flu
    • IRMA assembled flu segments now in sorted order
  • All TheiaEuk Workflows

    • Additional genes for Candida auris are now examined by default in the Snippy_Gene_Query task
    • Bug fix to the snippy_variants_num_variants output column for Cryptococcus neoformans
  • TheiaMeta_Illumina_PE

    • MIDAS is now an optional task in TheiaMeta.
  • All TheiaProk Workflows

    • stxtyper was added to all TheiaProk workflows
  • TheiaProk_Illumina_PE and TheiaProk_Illumina_SE

    • Multi-lane Illumina data can now be used as input natively.
  • TheiaProk_Illumina_PE and TheiaProk_ONT

    • TBProfiler has been updated to v6.4.1
    • tbp-parser has been updated to v2.2.2
  • Augur_PHB

    • Versioning information for the tree-building tools is now available
  • All Freyja Workflows

    • Freyja now supports non-SARS-CoV-2 organisms natively.
  • Mercury_Prep_N_Batch

    • Errors no longer occur when data has been previously transferred
    • The correct information is now being provided for GISAID’s covv_coverage column for ClearLabs data
    • Failures now fail the task
  • Snippy Workflows

    • A new file with QC metrics has been created
    • Additional QC metrics are now output
  • Terra_2_NCBI_PHB

    • Collection dates will no longer have decimals

📚 Documentation Updates

  • Search tables better with table-specific search bars
  • Dead links removed
  • Generally improved documentation

What's Changed

Read more

v2.2.1

17 Sep 15:12
9a10de7
Compare
Choose a tag to compare

Public Health Bioinformatics v2.2.1 Patch Release Notes

🩹 This patch release fixes the output names for the NCBI-Scrub standalone workflows.

Our documentation has also been migrated to GitHub for easier maintenance.

Full release notes can be found here!
Find our documentation here!

What's Changed

  • [Documentation] Transfer all PHB documentation to GitHub by @sage-wright in #605
  • [NCBI Scrub Standalone Workflows] Correct output declarations for the number of spots removed by @cimendes in #610
  • [v2.2.1] update version tag by @sage-wright in #622

Full Changelog: v2.2.0...v2.2.1

v2.2.0

03 Sep 13:22
5be3433
Compare
Choose a tag to compare

Public Health Bioinformatics v2.2.0 Minor Release Notes

This minor release adds two new workflows, Create_Terra_Table_PHB and Snippy_Streamline_FASTA_PHB, and makes significant improvements to the TheiaProk, TheiaCoV, TheiaMeta, and Freyja workflow series. Additionally, several bug fixes have been made.

Full release notes can be found here!

Find our documentation here!

🆕 New workflows:

  • Create_Terra_Table_PHB

    • The manual creation of Terra tables can be tedious and error-prone. This workflow will automatically create your Terra data table when provided with the location of the files. It can import assembly, paired-end (Illumina) and single-end (Illumina and Oxford Nanopore) data.
    • Import the workflow from Dockstore.
  • Snippy_Streamline_FASTA_PHB

    • Since Snippy_Variants_PHB is now compatible with assembled sequences as input in FASTA format, we have developed Snippy_Streamline_FASTA, an all-in-one approach to generating a reference-based phylogeny using the Snippy tools, mirroring the Snippy_Streamline_PHB workflow. By default, it runs Snippy_Variants and Snippy_Tree, but will optionally run Assembly_Fetch if a reference genome is not provided.
    • Import the workflow from Dockstore.

🚀 Changes to existing workflows:

  • All TheiaProk Workflows

    • Genomic characterization with emmtyper is now enabled for Streptococcus pyogenes. (Thanks, @sam-baird!)
    • When call_ani is true, failures will no longer occur if multiple hits have the same score.
    • Support for Vibrio parahaemolyticus, Vibrio vulnificus and Enterobacter asburiae was added to the AMRFinderPlus task
    • VirulenceFinder now runs on Shigella sonnei samples.
    • The Docker containers for AMRFinderPlus, tbp-parser and mlst have been updated:
      • AMRFinderPlus: 3.12.8-2024-07-22.1
      • tbp-parser: tbp-parser:1.6.0
      • mlst: 2.23.0-2024-08-01
    • Genomic characterization can now be skipped by setting the new optional input perform_characterization to false.
    • The GAMBIT prokaryotic database has been updated to v2.0.0-20240628.
    • Optional inputs are now available for all tasks within the merlin_magic subworkflow.
  • All TheiaCoV Workflows

    • GenoFLU has been added for H5N1 influenza typing.
    • Additional VADR output files have been exposed:
      • File? vadr_feature_tbl_pass
      • File? vadr_feature_tbl_fail
      • File? vadr_classification_summary_file
      • File? vadr_all_outputs_tar_gz
    • Aligned FASTQs no longer contain supplemental/secondary alignments.
  • TheiaCoV_Illumina_PE_PHB and TheiaCoV_ONT_PHB

    • Workflow will no longer fail if an assembly cannot be produced. The assembly_fasta column will say "Assembly could not be generated".
  • TheiaEuk_Illumina_PE_PHB

    • TheiaEuk no longer abruptly fails if an organism outside of the expected list of taxa is detected by GAMBIT.
    • All optional inputs and docker containers for taxa-specific sub-modules have been exposed.
  • All ONT workflows (TheiaProk and TheiaCoV)

    • KMC is no longer used for genome-size prediction. Instead, for TheiaProk, the expected genome length is now set to 5 Mb, which is around 0.7 Mb larger than the average bacterial genome length. For TheiaCoV, species have default genome lengths associated with their organism tag.
  • TheiaCoV and TheiaMeta workflows

    • The human read removal tool (HRRT) has been updated to v2.2.1. For paired-end data, reads are first interleaved to guarantee that no mates are orphaned by this tool.
  • All Freyja Workflows

    • Freyja has been updated for all workflows to version 1.5.1.
    • SARS-CoV-2 UShER barcodes file is now a .feather file.
    • Freyja_FASTQ_PHB is now compatible with Illumina paired-end, Illumina single-end and Oxford Nanopore data. A new input ont has been added to control workflow behavior.
    • The UShER barcodes and lineage files used are now exposed as outputs in Freyja_FASTQ_PHB
  • Snippy_Variants_PHB

    • In addition to reads, paired-end, and single-end, assemblies are now accepted as input. If Illumina sequencing data is to be used, use the read1 and optionally, the read2, optional inputs to pass the forward and reverse-facing reads respectively, If assembled genomes are to be used, use the assembly_fasta input and omit read1 and read2.
  • SRA_Fetch_PHB

    • SRA-Lite files are now detected when it's a low-quality file.
  • Augur_PHB

    • mpox mutation context has been added to the auspice_input_json output which displays the fraction of G->A or C->T.
  • GAMBIT_Query_PHB

    • The GAMBIT prokaryotic database has been updated to v2.0.0-20240628.
  • Mercury_Prep_N_Batch_PHB

What's Changed

  • [TheiaProk] Add emmtyper task for Streptococcus pyogenes by @sam-baird in #524
  • [SRA-Fetch] Detect SRA-Lite when it's low quality file by @cimendes in #512
  • Adding the Create_Terra_Table_PHB workflow by @sage-wright in #533
  • [Create_Terra_Table] recognize fastq files that end in .fq by @sage-wright in #535
  • [TheiaProk - ANI] prevent failures when multiple top hits have the same score by @sage-wright in #532
  • [TheiaCoV] Flu: Prevent workflow failures when assembly cannot be produced; generate NanoPlot outputs regardless of assembly success by @sage-wright in #530
  • [theiaprok] amrfinderplus: add support for Vibrio parahaemolyticus, Vibrio vulnificus, Enterobacter asburiae. Fix C diff bug by @kapsakcj in #542
  • [TheiaCoV] Add GenoFLU for flu whole-genome genotyping by @sage-wright in #540
  • [TheiaProk] Merlin_magic subwf bugfix: run virulencefinder on Shigella sonnei by @kapsakcj in #543
  • [TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task by @cimendes in #527
  • [TheiaCoV and TheiaMeta - HRRT] Patch bug by removing unneeded awk verification by @cimendes in #550
  • Create CODEOWNERS by @AndrewLangvt in #554
  • [TheiaProk] Add additional input enabling characterization by @sage-wright in #547
  • Updating templates & broken links in the readme by @sage-wright in #555
  • [TheiaEuk] Fix bug where String outputs were being passed as File for Snippy_variants by @cimendes in #574
  • [TheiaProk] update tbp-parser to latest version by @sage-wright in #576
  • [Create_Terra_Table] fix bug, and enable ability for users to provide their own file ending suffixes by @sage-wright in #575
  • [theiacov] Add additional vadr output files & tarball; upgrade VADR docker by @kapsakcj in #556
  • [ONT] Remove KMC by @sage-wright in #578
  • [Create_Terra_Table] fix sample name i...
Read more

v2.1.0

26 Jun 14:14
d0377e1
Compare
Choose a tag to compare

Public Health Bioinformatics v2.1.0 Minor Release Notes

This minor release improves the utility and usability of several Oxford Nanopore Technologies’ dedicated workflows for viral and bacterial genomic characterization (TheiaCoV and TheiaProk). Additionally, support for new organisms has been added to several workflows.

Full release notes can be found here!

Find our documentation here!

🚀 Changes to existing workflows:

  • All TheiaProk Workflows

    • General Abricate is now available though the call_abricate and abricate_db optional inputs.
    • Abricate specifically for Vibrio cholerae is now available. It launches automatically if the gambit_predicted_taxon or expected_taxon is Vibrio cholerae.
    • A new optional parameter separate_betalactam_genes is now available that splits AMRFinderPlus beta-lactam hits into new columns.
    • The call_midas optional input is now set to false by default.
  • TheiaProk_Illumina_PE

    • New read quality-control outputs have been added: r1_mean_q_clean, r2_mean_q_clean, r1_mean_readlength_clean and r2_mean_readlength_clean.
  • TheiaProk_ONT

    • New read quality-control outputs have been added: nanoplot_r1_median_readlength_raw, nanoplot_r1_stdev_readlength_raw, nanoplot_r1_n50_raw, nanoplot_r1_median_q_raw, nanoplot_r1_est_coverage_raw, nanoplot_r1_median_readlength_clean, nanoplot_r1_stdev_readlength_clean, nanoplot_r1_n50_clean, nanoplot_r1_median_q_clean and nanoplot_r1_est_coverage_clean.
    • Kraken2 is now available through the call_kraken and kraken_db optional inputs.
    • A maximum genome size of 10Mbp is set to prevent excessive runtimes.
  • All TheiaCoV Workflows

    • RSV-A and RSV-B are now able to be analyzed with the TheiaCoV workflows. Nextclade characterization and Kraken taxonomic analysis will now be run on RSV samples.
    • The following default organisms now have the following Nextclade dataset tags:
      Organism New default Nextclade dataset tag
      SARS-CoV-2 "2024-06-13--23-42-47Z"
      mpox "2024-04-19--07-50-39Z"
      Flu H1N1 HA "2024-04-19--07-50-39Z"
      Flu H1N1 NA "2024-04-19--07-50-39Z"
      Flu H3N2 HA "2024-04-19--07-50-39Z"
      Flu H3N2 NA "2024-04-19--07-50-39Z"
      Flu Victoria HA "2024-04-19--07-50-39Z"
      Flu Victoria NA "2024-04-19--07-50-39Z"
  • TheiaProk_ONT

    • New read quality-control outputs have been added: nanoplot_r1_median_readlength_raw, nanoplot_r1_stdev_readlength_raw, nanoplot_r1_n50_raw, nanoplot_r1_median_q_raw, nanoplot_r1_est_coverage_raw, nanoplot_r1_median_readlength_clean, nanoplot_r1_stdev_readlength_clean, nanoplot_r1_n50_clean, nanoplot_r1_median_q_clean and nanoplot_r1_est_coverage_clean.
  • TheiaCoV Flu Track

    • All of the flu-specific tasks now live in their own sub-workflow, flu_track. This has no effect on the end-user.
    • In TheiaCoV_ONT, flu samples will now have both the HA and NA segment’s assembly mean coverage appear in the assembly_mean_coverage output variable. This reflects the behaviour already present on TheiaCoV_Illumina_PE.
    • The all-segments FASTA header lines now include samplename.
    • The new output irma_subtype_notes now indicates if IRMA was able to determine the flu subtype
    • All workflows now uses abricate_flu_subtype (instead of irma_subtype) for selecting the appropriate nextclade_dataset_tag.
    • Nextclade outputs columns for flu now explicitly state either HA or NA.
    • Padded assemblies, where - or . present in the final assembly file are either removed or replaced by N (respectively), are now being provided to MAFFT and VADR to prevent task failures.
  • Terra_2_NCBI

    • Skipping BioSample submission via the skip_biosample optional now skips the requirement to have BioSample metadata in your data table.
  • Augur_Prep_PHB and Augur_PHB

    • RSV-A and RSV-B can now be analyzed with the Augur workflows.
    • Metadata no longer required to run Augur. Only a distance tree will be created if metadata is not provided.
  • kSNP3 and other phylogenetic inference workflows

    • Outputs from phylogenetic workflows (SNP matrices) and the summarize_data task will now have a properly toggleable Phandango coloring suffix.
    • The phandango_coloring optional input is now off by default.

Docker container updates:

  • IRMA has been updated to version v1.1.5
  • AMRFinderPlus has been updated to version v3.12.8-2024-05-02.2
  • ts_mlst database has been updated as of 2024-06-01
  • Pangolin database has been updated to pdata v1.27

🐛 Bug fixes and small improvements:

  • TheiaProk_ONT and TheiaProk_FASTA: Hicap was being run in TheiaProk_ONT but the outputs were never appearing in the data table! This has been fixed.
  • All TheiaCoV workflows: Unsupported organisms will no longer cause workflow failures.
  • Terra_2_NCBI: Fixed a typo when using the Wastewater Biosample package that was causing an error.
  • Freyja_Dashboard: The freyja_dasbhoard output variable now correctly says freyja_dashboard.
  • Workflows that accept String inputs that are used to name things: Several input variables such as cluster_name now accept Strings with whitespace.
  • All workflows: Runtime parameters have been adjusted for several tasks.
  • TheiaCoV Flu Track: A bug has been fixed for IRMA running out of disk space. Additionally, another bug affecting Flu B samples was fixed related to empty HA segment FASTA files.

What's Changed

  • TheiaCoV wf support for RSV - run nextclade by default and small optimizations (kraken_target_organism, genome_length) by @kapsakcj in #436
  • [New workflow - internal] Gambitcore for assembly quality assessment with GAMBIT by @cimendes in #466
  • [TheiaProk_ONT and TheiaCoV_ONT] Expose additional QC metrics from nanoplot for both raw and clean reads by @cimendes in #452
  • Exposing r1 and r2 mean_q_clean and mean_readlength_clean by @jrotieno in #455
  • [TheiaProk_ONT] add patch fix to kmc estimated genome size to not go over 10Mbp by @cimendes in #459
  • Add abricate as optional module by @jrotieno in #431
  • [TheiaProk_ONT] Add Kraken2 as part of read_qc by @cimendes in #438
  • [Flu] Assembly mean coverage & read screen clean-up by @sage-wright in #469
  • [Freyja_Dashboard] fix typo in freyja_dashboard output File variable name by @AndrewLangvt in #482
  • [Terra_2_NCBI] remove metadata requirements with skip_biosample == true by @sage-wright in #475
  • Augur Updates for RSV-A and RSV-B by @jrotieno in #478
  • [kSNP3] fix behaviour when phandango colouring is set to false by @cimendes in #496
  • [Internal] Updating runtime parameters by @sage-wright in #494
  • Automatically convert spaces to dashes in workflows that accept strings by @AndrewLangvt in #498
  • [TheiaCoV] Enable user to run TheiaCoV with an unsupported organism by @sage-wright in #501
  • [AMRFinderPlus] parse BETA-LACTAM genes and subclasses into individual output columns by @sage-wright in #505
  • IRMA bug fixes & improvements; theiacov_illumina_pe wf updates for Flu by @kapsakcj in #468
  • Augur_PHB: Set sample_metadata_tsvs input to optional by @jrotieno in #503
  • [Internal - Gambitcore] Downgrade database to stable 1.3.0 version by @cimendes in #473
  • [TheiaCoV_Illumina_PE & _ONT] Create sub-workflow for flu-specific modules by @sage-wright in #502
  • [TheiaProk] Add abricate module for vibrio characterization by @cimendes in #429
  • [TheiaProk] expose hicap outputs in theiaprok_fasta and theiaprok_ont by @cimendes in #508
  • Fix typo in Terra_2_NCBI Wastewater metadata by @michellescribner in #519
  • [TheiaProk] Update amrfinderplus to v3.12.8; DB: v2024-05-02.2; reduce compute resources by @kapsakcj in #514
  • [TheiaProk] upgrade mlst docker image to 2024-06-01 staphb build; reduced runtime parameters; enable preemptible by @kapsakcj in #516
  • update default...
Read more

v2.0.1

01 May 21:54
e6c97dc
Compare
Choose a tag to compare

Public Health Bioinformatics v2.0.1 Patch Release Notes

🩹 This patch release updates the default midas_db location

Full release notes can be found here!
Find our documentation here!

What's Changed

Full Changelog: v2.0.0...v2.0.1

v2.0.0

22 Apr 18:15
880a66c
Compare
Choose a tag to compare

Public Health Bioinformatics v2.0.0 Release Notes

This major release simplifies the usage of the TheiaCoV workflows and does major restructuring on all inputs and outputs on several workflows, including TheiaCoV, TheiaProk, TheiaEuk, and TheiaMeta. Additionally, it introduces three new workflows, improves on several workflows, and resolves various bugs.

Full release notes can be found here.

All inputs and outputs have been standardized across all of PHB. More information can be found here.

Find our documentation here!

🆕 New workflows:

  • Kraken2_ONT_PHB

  • TBProfiler_tNGS_PHB

    • This workflow is still in a beta state; development is currently ongoing.
    • It is used to process targeted next-generation sequencing (tNGS) Mycobacterium tuberculosis data for antimicrobial resistance (AMR) characterization with TBProfiler and tbp-parser. It includes quality assessment and control with Trimmomatic.
    • Import the workflow from Dockstore
  • Find_Shared_Variants_PHB

    • Find_Shared_Variants_PHB is a workflow for concatenating the variant results produced by the Snippy_Variants_PHB workflow across multiple samples and reshaping the data to illustrate variants that are shared among multiple samples.
    • Import this workflow from Dockstore

🚀 Changes to existing workflows:

  • TheiaCoV, TheiaProk, TheiaEuk and TheiaMeta workflows

    • All inputs and outputs have been standardized across all workflow series
  • TheiaCoV Workflow Series

    • The workflow_parameters sub-workflow now controls all taxa-specific optional inputs in TheiaCov. The default value for the organism input is still set to "sars-cov-2".

    • VADR is now enabled for flu, rsv-a and rsv-b.

    • Nextclade has been updated to v3. Older dataset tags than the ones provided by default are not compatible with the current version. See below for the list of updated nextclade_dataset_tags.

    • Nextclade dataset names & their default values in TheiaCoV workflows have also changed. For example "sars-cov-2" is now called "nextstrain/sars-cov-2/wuhan-hu-1/orfs". The name "sars-cov-2" still works as an alias, but we recommend using the full name because it is more descriptive and clearer, and will be supported by Nextclade for the foreseeable future.

      Organism Old Dataset Name New Dataset Name New Dataset Tag
      SARS-CoV-2 "sars-cov-2" "nextstrain/sars-cov-2/wuhan-hu-1/orfs" 2024-04-15--15-08-22Z
      Mpox (specifically, Mpox lineage B.1 dataset) "hMPXV_B1" "nextstrain/mpox/lineage-b.1" 2024-01-16--20-31-02Z
      Influenza A H1N1 HA "flu_h1n1pdm_ha" "nextstrain/flu/h1n1pdm/ha/MW626062" 2024-01-16--20-31-02Z
      Influenza A H3N2 HA "flu_h3n2_ha" "nextstrain/flu/h3n2/ha/EPI1857216" 2024-02-22--16-12-03Z
      Influenza B Victoria HA "flu_vic_ha" "nextstrain/flu/vic/ha/KX058884" 2024-01-16--20-31-02Z
      Influenza B Yamagata HA "flu_yam_ha" "nextstrain/flu/yam/ha/JN993010" 2024-01-30--16-34-55Z
      Influenza A H1N1 NA "flu_h1n1pdm_na" "nextstrain/flu/h1n1pdm/na/MW626056" 2024-01-16--20-31-02Z
      Influenza A H3N2 NA "flu_h3n2_na" "nextstrain/flu/h3n2/na/EPI1857215" 2024-01-16--20-31-02Z
      Influenza B Victoria NA "flu_vic_na" "nextstrain/flu/vic/na/CY073894" 2024-01-16--20-31-02Z
      RSV-A "rsv_a" "nextstrain/rsv/a/EPI_ISL_412866" 2024-01-29--10-29-43Z
      RSV-B "rsv_b" "nextstrain/rsv/b/EPI_ISL_1653999" 2024-01-29--10-29-43Z
  • TheiaCoV Flu Track

    • For the flu track:
      • Tamiflu-resistance determination has been removed in favor of the oseltamivir nomenclature. Additionally, amantadine and rimantadide were added.
        • We now check for antiviral resistance mutations against the following 10 antiviral drugs: A_315675, amantadine, compound_367, favipiravir_resistanceflu_fludase, L_742_001, laninamivir, peramivir, pimodivir, rimantadine, oseltamivir, xofluza, zanamivir.
      • For TheiaCoV_Illumina_PE, assembly coverage is now computed for both HA and NA segments
      • Nexclade outputs are now computed for the NA fragment as well as HA
  • TheiaProk Workflow Series

    • Plasmidfinder can now be toggled off through the call_plasmidfinder optional input
    • Trimmomatic encoding is now set to 33 by default to avoid failures when processing SRA-Lite formatted FASTQ files
  • TheiaMeta

    • Automated binning has been integrated into TheiaMeta when a reference file is not provided. Binning is performed with SemiBin2
    • The assembly module optional inputs have been exposed, allowing the user to control the behavior of metaSPAdes and Pilon
  • SRA_Fetch

    • A new warning column has now been implemented indicating if the downloaded file is suspected to be in SRA-Lite format

Docker container updates:

  • Augur has been updated to commit hash cec4fa0ecd8612e4363d40662060a5a9c712d67e, from 2024-02-01
  • BUSCO has been updated to version v5.7.1. Due to memory issues when running eukaryotic assemblies, TheiaEuk was excluded from this update and still runs on version v5.3.2
  • pasty has been updated to version v1.3.0
  • tbp-parser has been updated to version v1.4.2
  • theiavalidate has been updated to version v0.1.0
  • ts_mlst database has been updated as of April 2024
  • VADR has been updated to version v1.6.3

🐛 Bug fixes and small improvements:

  • All workflows: Fastq_Scan outputs have been renamed (now prefixed with fastq_scan_*) to differentiate them from fastQC. Several outputs for FastP and fastQC are now exposed such as the respective report HTMLs.
  • TheiaCoV (all workflows): Edge-case bugs in QC_check and Pangolin have been resolved. The percent gene coverage task has been modularized.
  • TheiaCoV Illumina PE: read1_aligned, read1_unaligned, read2_aligned, read2_unaligned, sorted_bam_aligned, sorted_bam_aligned_bai, sorted_bam_unaligned, and sorted_bam_uanligned_bai are now outputted by the workflow.
  • TheiaProk (all workflows): midas_secondary_genus_coverage (the secondary genus absolute coverage) is now output.
  • TheiaEuk: Several outputs from the snippy_variants task have been exposed: snippy_variants_num_reads_aligned, snippy_variants_num_variants, snippy_variants_coverage_tsv, and snippy_variants_percent_ref_coverage.
  • BaseSpace_Fetch: A fix has been implemented that greatly speeds up the download of data from BaseSpace when using Basespace "Projects" to organize sequencing runs.
  • Snippy_Streamline: snippy_concatenated_variants and snippy_shared_variants are now exposed as Snippy_Streamline outputs. The snippy_snp_matix output has been deprecated in favor of snippy_wg_snp_matrix and snippy_cg_snp_matrix.
  • kSNP3: ksnp3_number_snps, ksnp3_number_core_snps and ksnp3_core_snp_table have been added to the collection of outputs.
  • Kraken2 Standalone (all workflows): Uncompressed read files can now be processed by all Kraken2 workflows
  • Freyja_FASTQ: A new optional input depth_cutoff has been added, giving the user the option to exclude sites with coverage depth below the provided value (by default no cutoff is performed). New outputs added: freyja_coverage and freyja_barcode_file

What's Changed

Read more

v1.3.0

17 Jan 18:03
c3f3b70
Compare
Choose a tag to compare

Public Health Bioinformatics v1.3.0 Release Notes

This minor release introduces two new workflows, improves on several workflows, and resolves various bugs

Full release notes can be found here.

🆕 New workflows:

🚀 Changes to existing workflows:

  • TheiaCoV_ONT_PHB

    • Influenza is now supported. Use "flu" for the organism optional input String parameter.
      • "sars-cov-2" and "HIV" tracks are unchanged.
  • TheiaProk Workflow Series

    • If user-input (expected_taxon) or predicted taxon by Gambit belongs to the Shigella genus, the Extensively Drug-Resistant phenotype is predicted using the new resfinder pointfinder database.
    • If user-input (expected_taxon) or predicted taxon by Gambit is the Mycobacterium tuberculosis species, bcftools indexes and merges all potential VCF files created by TbProfiler (both .bcf and .gz files).
    • Kraken2 has been added as an optional module (except for TheiaProk_ONT_PHB). If call_kraken is true, a database must be provided through kraken_db.
    • Two new optional inputs were added to control ANIm behaviour: ani_threshold (default 85.00) and percent_bases_aligned_threshold (default 70.00).
  • TheiaCoV_FASTA_PHB

    • The list of allowed input organism now includes "sars-cov-2" (default), "rsv_a", "rsv_b", "WNV", "MPXV" and "flu".
  • TheiaCoV_Illumina_PE_PHB

    • If organism is set as "flu", the workflow searches for antiviral mutations in the HA, NA, PA, PB1 and PB2 assembly segments, targeting the following 10 antivirals.: A_315675, compound_367, Favipiravir, Fludase, L_742_001, Laninamivir, Peramivir, Pimodivir, Xofluza and Zanamivir.
  • All Illumina SE and PE Workflows

    • A new optional input, read_qc, to allow the user to decide between fastq_scan and fastqc for the evaluation of read quality. The affected workflows are: TheiaCoV_Illumina_PE_PHB, TheiaCoV_Illumina_SE_PHB, TheiaProk_Illumina_SE_PHB, TheiaProk_Illumina_PE_PHB, TheiaMeta_Illumina_PE_PHB and Freyja_FASTQ_PHB.
  • CZGenEpi_Prep_PHB

    • Instead of extracting the sample_is_private_column_name and the gisaid_id_column_name columns, these columns are now generated by the program using already-provided inputs and by the new is_private Boolean variable which is used to set the value for all samples in the set. The field "GISAID ID (Public ID) - Optional" will now reflect the GISAID syntax for Virus Name.

Docker container updates:

  • AMRFinderPlus has been updated to version v3.11.20 and database 2023-09-26.1
  • tbp-parser has been updated to version 1.2.0
  • Freyja has been updated to version 1.4.8
  • ts_mlst database has been updated as of January 2024
  • Gambit has been updated to version 1.3.0, including its database files
  • Pangolin4 has been updated to version 4.3.1-pdata-1.23.1
  • IRMA has been updated to version 1.1.3

Tag updates:

  • SARS-CoV-2 Nexclade Dataset Tag has been updated to 2023-12-03T12:00:00Z

🐛 Bug fixes and small improvements:

  • kSNP3_PHB: The ksnp3_core_vcfoutput has been renamed to ksnp3_vcf_ref_genome for readability. Additionally, two new outputs are provided: ksnp3_vcf_snps_not_in_ref and ksnp3_vcf_ref_samplename.
  • TheiaProk Workflow Series: The MIDAS task was adjusted to reduce logging, and therefore the size of the log file, aiding debugging & reducing storage costs.
  • TheiaMeta_Illumina_PE_PHB: A new task Krona was added for the visualization of the Kraken2 reports.
  • Mercury_Prep_N_Batch: The excluded_samples.tsv is now printed to the execution log file, aiding debugging.
  • TheiaCoV Workflow Series: The nextclade_lineage output now populates correctly for SARS-CoV-2. Additionally, the nexclade_qc field is now exposed as an output.
  • Augur_PHB: The AUGUR refine input clock_filter_iqd has been reverted to the previous default value of 4.
  • Kraken Standalone Workflows: A new task Krona was added for the visualization of the Kraken2 reports.
  • TheiaValidate_PHB: TheiaValidate now outputs a table with validation-criteria failures only. Additionally, a new input was added that can translate different column names between tables to enable comparison.
  • TheiaCoV_ONT_PBH: If a sample fails quality check with read screening, this will no longer cause the workflow to fail. Instead, it will finish with an appropriate message.
  • Samples_To_Ref_Tree_PHB: The organism input has been renamed to nextclade_dataset_name for better clarity.
  • Various workflows: Call caching was disabled in the following workflows: BaseSpace_Fetch_PHB, Transfer_Column_Content_PHB, Assembly_Fetch_PHB, Snippy_Streamline_PHB and TheiaValidate_PHB.

What's Changed

  • updated VCF output file renaming in kSNP3 task by @kapsakcj in #207
  • reduce unnecessary logging in MIDAS task by @kapsakcj in #210
  • update default amrfinderplus docker image to v3.11.20 and db 2023-09-26.1 by @kapsakcj in #229
  • TheiaCoV_ONT_PHB Influenza Track by @jrotieno in #233
  • TheiaCoV_FASTA_Batch: TheiaCoV_FASTA, for many samples at once by @sage-wright in #238
  • Add krona task to TheiaMeta_Illumina_PE by @cimendes in #213
  • added 2 QC thresholds to ANI task to reduce false positives by @kapsakcj in #168
  • Resfinder improvements, added support for Shigella spp., added XDR Shigella prediction by @kapsakcj in #159
  • disable call caching for various workflows by @kapsakcj in #251
  • Mercury_Prep_N_Batch: print the excluded_samples.tsv and update Docker to avoid Google SDK warning by @sage-wright in #220
  • Nextclade Output Added by @DOH-HNH0303 in #239
  • TheiaCoV_FASTA: Adding five new organisms by @jrotieno in #194
  • Update task_augur_refine iqd back to 4 by @jrotieno in #268
  • TheiaCoV Illumina PE: Identify Influenza Antiviral Resistance Mutations in Assemblies by @jrotieno in #252
  • [New Utility] Workflow to rename FASTQ files (non-destructive) by @cimendes in #267
  • [TheiaCoV_Fasta_Batch] Substitute FASTA concatenating task to ensure proper sample_id propagation by @cimendes in #274
  • Kraken2 Standalone: add krona visualisation by @cimendes in #225
  • TheiaValidate_PHB: new features and new Docker image from TheiaValidate repository by @sage-wright in #255
  • TheiaProk TB: new VCF output and modification to the coverage report by @sage-wright in #245
  • TheiaCoV_ONT: prevent failure by coercing files into strings by @sage-wright in #288
  • update default freyja docker image to 1.4.8 for multiple tasks by @kapsakcj in #289
  • FastQC added as an optional module in all Illumina_PE and Illumina_SE workflows by @sage-wright in #260
  • update docker to version tag 2.23.0-2024-01 by @cimendes in #293
  • [TheiaProk Workflows] Add Kraken2 as optional module by @cimendes in #286
  • CZG...
Read more

v1.2.1

23 Oct 20:19
ab54419
Compare
Choose a tag to compare

Public Health Bioinformatics v1.2.1 Release Notes

This patch release resolves various bugs and updates workflow defaults.

🐛 Bug Fixes

🦑 Kraken2_PE

  • A bug was fixed in the Kraken2_PE_PHB standalone workflow where the workflow was expecting required outputs from the Kraken2_standalone task that are now optional. This solves the issue encountered when trying to import the workflow which would be unsuccessful.

Impacted Workflows/Tasks:

  • Kraken2_PE_PHB

The following workflows uses Kraken2_standalone task but have not been affected as they do not require the affected outputs:

  • TheiaMeta_Illumina_PE_PHB
  • Kraken2_SE_PHB

The following workflows use a different Kraken2 task and have not been affected:

  • TheiaCoV_Illumina_PE_PHB
  • TheiaCoV_Illumina_SE_PHB

🌲 Augur

  • The requirement to present genes and colors input files was causing run failures for non-MPXV tree builds. These files are no long required.

Users reported issues with with optional Augur_PHB inputs, specifically colors_tsv, with the following error messages:

  • Error_1:"Failed to evaluate 'colors_tsv' (reason 1 of 1): Evaluating select_first([colors, mpxv_defaults.colors]) failed: select_first was called with 2 empty values. We needed at least one to be filled."
  • Error_2: "Failed to evaluate 'genes' (reason 1 of 1): Evaluating select_first([genes, mpxv_defaults.genes]) failed: select_first was called with 2 empty values. We needed at least one to be filled."

📚 Read Screen

  • The read screen task is designed to assess the quantity and quality of reads used as the input to the workflow, and halt the workflow if it is determined that the reads are insufficient. One of the qualities of the reads that is checked is the proportion of reads found in the R1 and R2 files.
    • The former implementation did not calculate the proportion of reads correctly, and the reported error message did not reflect the defined parameter correctly.
    • The math has been updated such that the ratio can not be unbalanced beyond a 60/40 split.

🔧 Workflows Updates

Workflows

🔬 TheiaCoV Workflows

  • The default nextclade_dataset_tag for SARS-CoV-2 was updated to "2023-09-21T12:00:00Z" (as of 2023-10-10) across all 5 TheiaCov workflows:
    • TheiaCoV_Illumina_PE_PHB, TheiaCoV_Illumina_SE_PHB, TheiaCoV_ClearLabs_PHB, TheiaCoV_ONT_PHB, TheiaCoV_FASTA_PHB

🦠 TheiaProk Workflows

  • KmerFinder was added to the TheiaProk suite of workflows to find the best match (species identification) of a fasta file in a (kmer) database (downloaded on 2023-09-11).
New Outputs
  • kmerfinder_docker
  • kmerfinder_results_tsv
  • kmerfinder_top_hit
  • kmerfinder_query_coverage
  • kmerfinder_template_coverage
  • kmerfinder_database

Task Files

🎙️ UShER

  • The runtime environment for the UShER task has been allocated additional compute resources to allow for larger input sets.
  • The following defaults for the Pilon task were changed:
    • CPU 4 -> 8
    • Memory 8 -> 32
  • Impacted Workflows/Tasks
    • UShER _PHB is the only affected workflow.

🔎 Pilon

  • The runtime environment for the Pilon task has been allocated additional compute resources to allow for larger input sets.
  • The following defaults for the Pilon task were changed:
    • CPU 4 -> 8
    • Memory 8 -> 32
  • Impacted Workflows/Tasks
    • TheiaMeta_Illumina_PE_PHB is the only affected workflow.

🏭 What's Changed

Full Changelog: v1.2.0...v1.2.1

Please see the full documentation for the PHB repository v1.2.1 release.

v1.2.0

03 Oct 14:02
801baa2
Compare
Choose a tag to compare

Public Health Bioinformatics v1.2.0 Release Notes

This minor release introduces three new workflows and resolves various bugs.

New workflows:

  • TheiaMeta_Illumina_PE_PHB
    This workflow offers a versatile approach to de novo metagenomic assembly, providing the option to use either reference-based or reference-independent metagenomic assembly. Taxonomic characterization is also performed with Kraken2.

  • CZGenEpi_Prep_PHB
    The CZGenEpi_Prep workflow formats metadata and assembly files for seamless integration with the Chan Zuckerberg GEN EPI platform.

  • Samples_to_Ref_Tree_PHB
    In this workflow, Nextclade is used to rapidly place new samples onto an existing reference phylogenetic tree. Phylogenetic placement is done by comparing the mutations of the query sequence (relative to the reference) with the mutations of every node and tip in the reference tree, and finding the node which has the most similar set of mutations. This operation is repeated for each query sequence, until all of them are placed onto the tree.

Changes in existing workflows

  • Kraken2_SE_PHB
    Kraken2 output files were not being correctly identified by the single-end standalone workflow, causing it to fail unexpectedly Output files should now populate on the Terra datatable correctly.

  • KMC
    The output type of est_genome_size is now an int so data can be sorted numerically in a Terra datatable when running TheiaProk_ONT. Additionally, this task no longer runs unnecessarily for the TheiaCoV_ONT workflow.

  • TS_MLST
    The database had been updated as of August 2023.

    New outputs:

    • ts_mlst_docker

Mycobacterium tuberculosis changes

  • TBProfiler
    The default variant caller has been adjusted to FreeBayes to accurately identify resistance-conferring deletions and multi-nucleotide variants (MNVs),

  • tbp-parser
    A TBProfiler parsing module has been added to apply variant interpretation logic based on recommendations by the WHO, CDC and CDPH to produce antitubercular drug resistance calls. Additionally, a set of machine and human-interpretable files are produced to facilitate data sharing and interpretation. Find the source code here.

    New inputs:

    • tbprofiler_output_seq_method_type (default="WGS")
    • tbprofiler_operator (default="")
    • tbp_parser_min_depth (default=10)
    • tbp_parser_coverage_threshold (default=100)
    • tbp_parser_debug (default=false)
    • tbp_parser_docker_image (default="us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:1.0.1")

    New outputs:

    • tbprofiler_lims_report_csv
    • tbprofiler_looker_csv
    • tbprofiler_laboratorian_report_csv
    • tbprofiler_resistance_genes_percent_coverage
    • tbp_parser_genome_percent_coverage
    • tbp_parser_version
    • tbp_parser_docker
  • Clockwork
    The clockwork module has been added to decontaminate read files of sequencing data that may come from a nontuberculous mycobacteria (NTM) or human genome.

    New outputs:

    • clockwork_decontaminated_read1
    • clockwork_decontaminated_read2
  • TBDB
    The TBProfiler module uses a database called TBDB. We have modified the code to allow for custom databases to be used in place of the default TBDB. Additionally, we have created a custom database including mutations from TBDB, the WHO catalog, and a list of mutations included in the CDC's MTB pipeline Varpipe.

    By default, TBProfiler runs with the default database. If the Boolean input tbprofiler_run_custom_db is set to true and no database is provided by the user, a database containing both TBProfiler's TBDB and CDC Varpipe's collection of resistance conferring mutations will be used by TBProfiler. In this database, the duplicate entries have been manually curated by removing the TBDB entry in favor of Varpipe's mutation annotation.

    New inputs:

    • tbprofiler_run_custom_db (default=false)
    • tbprofiler_custom_db (default="gs://theiagen-public-files/terra/theiaprok-files/tbdb_varpipe_combined.tar.gz")

Bug Fixes

  • In the KMC task, the -n flag has been added to the echo command to avoid newline characters
  • An optional snippy_core_bed file input has been added to the Snippy_Tree workflow to enable site masking, and thereby exposing this optional input to the Snippy_Streamline workflow.
  • The memory input for quast has been adjusted to match the style guide in TheiaEuk_Illumina_PE_PHB workflow.
  • The version_capture task now uses a Docker image hosted on Theiagen's Google Artifact Registry (GAR) instead of DockerHub; we also exposed docker as an optional input for this task.
  • The plasmidfinder output parsing was overambitious when removing duplicates and removed every instance of a duplicate, instead of just one. This has been resolved.

What's Changed

Full Changelog: v1.1.0...v1.2.0

View our documentation here!

v1.1.0

30 Aug 20:17
87f1695
Compare
Choose a tag to compare

Public Health Bioinformatics v1.1.0 Release Notes

This minor release introduces two new workflows, changes the outputs for the ONT workflows, and resolves various bugs.

New workflows:

  • Terra_2_GISAID
    This workflow will submit concatenated metadata and assembly files to GISAID directly from Terra. The user must obtain a GISAID client-id before they can use this workflow.

  • Usher_PHB
    This workflow will place your samples onto the most up-to-date versions of the UCSC's UShER phylogenetic trees and return subtree(s) that the user can visualize.

Major output changes in TheiaCoV_ONT and TheiaProk_ONT workflows

We identified an issue when using cg_pipeline in our ONT workflows that led to inaccurate QC metrics. We have corrected this issue by deprecating the use of cg_pipeline in all ONT workflows. QC metrics are now calculated using nanoplot, which is a tool geared specifically for ONT data. In addition, since fastq-scan is now redundant in these workflows, it has been removed.

Also, the maximum read length in TheiaProk_ONT was previously set to 10,000 base pairs. We have increased this to 100,000 base pairs by default.

  • TheiaProk_ONT New Outputs
    The following columns are new.

    • nanoplot_num_reads_clean1
    • nanoplot_num_reads_raw1
    • nanoplot_r1_mean_q_clean
    • nanoplot_r1_mean_q_raw
    • nanoplot_r1_mean_readlength_clean
    • nanoplot_r1_mean_readlength_raw
    • nanoplot_tsv_clean
    • nanoplot_tsv_raw
    • nanoplot_version
    • nanoplot_docker
    • nanoplot_html_clean
    • nanoplot_html_raw

    The following variables are now generated using nanoplot:

    • est_coverage_raw
    • est_coverage_clean

    The following variables have been removed:

    • num_reads_clean1
    • num_reads_raw1
    • r1_mean_q_raw
    • r1_mean_readlength_raw
    • fastq_scan_version
  • TheiaCoV_ONT New Outputs
    The following columns are new.

    • nanoplot_tsv_clean
    • nanoplot_tsv_raw
    • nanoplot_version
    • nanoplot_docker
    • nanoplot_html_clean
    • nanoplot_html_raw
    • est_coverage_raw
    • est_coverage_clean
    • r1_mean_readlength_clean
    • r1_mean_readlength_raw
    • r1_mean_q_clean
    • r1_mean_q_raw

    The following variables are now generated using nanoplot:

    • num_reads_clean1
    • num_reads_raw1

    The following variables have been removed:

    • fastq_scan_version

Bug Fixes

  • Corrected an inaccurate file extension in the augur workflow.
  • Adjusted several files to meet the style guide
  • Adjusted the default value for the core_genome input in Snippy_Tree to be true.
  • Fixed a bug in the summarize_data task
  • Fixed a bug and added new outputs in the SRA_Fetch workflow
  • Enabled the skipping of extra header columns in the Concatenate_Column_Content workflow
  • Added the .gfa file from Dragonflye as output
  • Updated default docker images and dataset tags for the Pangolin and Nextclade tasks.
  • Updated the GAMBIT database to v1.1.0
  • The GAMBIT docker image has been updated to use the latest GAMBIT version
  • Fixed a bug in file name parsing in the Lyve_Set_PHB workflow
  • Skipped the genome size estimation in the read_screen task for all ONT workflows.

What's Changed

Full Changelog: v1.0.1...v1.1.0