Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TheiaMeta: Viral Metagenomics workflow #64

Merged
merged 87 commits into from
Sep 20, 2023
Merged

Conversation

cimendes
Copy link
Member

@cimendes cimendes commented May 22, 2023

Closes #110

Setting as draft as development is still underway.

🛠️ Changes Being Made

This PR features a new workflow, TheiaMeta_Illumina_PE, for the assembly of viral metagenomic data.
The diagram of the workflow is available below:

Viral Metagenomic workflow - TheiaMeta_Illumina_PE (3)

🧠 Context and Rationale

📋 Workflow/Task Steps

Please see the diagram above.

This workflow takes in Illumina PE data and performs:

  • taxonomical assignment of the raw read data (kraken2)
  • host removal (human, ncbi human scrubber),
  • quality trimming (trimmomatic)
  • metagenomic genome assembly (metaspades + pilon)
  • IF A REFERENCE FILE IS PROVIDED: contig capture to the reference (minimap2)
  • retrieval of assembled and unassembled reads (samtools)

The following quality metrics are computed:

  • assembly_length
  • assembly_mean_coverage
  • kraken2_percent_human
  • percent_coverage
  • raw and clean reads metrics (sequence number)

Inputs

Mandatory inputs:

  • read1
  • read2
  • samplename

Optional outputs:

  • reference
  • trimming/qc thresholds
  • memory, docker, CPU and disk size for all tasks

Outputs

{
    "theiameta_illumina_pe.assembly_fasta": "${this.assembly_fasta}",
    "theiameta_illumina_pe.assembly_length": "${this.assembly_length}",
    "theiameta_illumina_pe.assembly_mean_coverage": "${this.assembly_mean_coverage}",
    "theiameta_illumina_pe.average_read_length": "${this.average_read_length}",
    "theiameta_illumina_pe.bbduk_docker": "${this.bbduk_docker}",
    "theiameta_illumina_pe.bedtools_docker": "${this.bedtools_docker}",
    "theiameta_illumina_pe.bedtools_version": "${this.bedtools_version}",
    "theiameta_illumina_pe.contig_number": "${this.contig_number}",
    "theiameta_illumina_pe.fastq_scan_docker": "${this.fastq_scan_docker}",
    "theiameta_illumina_pe.fastq_scan_version": "${this.fastq_scan_version}",
    "theiameta_illumina_pe.kraken2_docker": "${this.kraken2_docker}",
    "theiameta_illumina_pe.kraken2_percent_human_raw": "${this.kraken2_percent_human_raw}",
    "theiameta_illumina_pe.kraken2_percent_human_clean": "${this.kraken2_percent_human_clean}",
    "theiameta_illumina_pe.kraken2_report_raw": "${this.kraken2_report_raw}",
    "theiameta_illumina_pe.kraken2_report_clean": "${this.kraken2_report_clean}",
    "theiameta_illumina_pe.kraken2_version": "${this.kraken2_version}",
    "theiameta_illumina_pe.largest_contig": "${this.largest_contig}",
    "theiameta_illumina_pe.metaspades_docker": "${this.metaspades_docker}",
    "theiameta_illumina_pe.metaspades_version": "${this.metaspades_version}",
    "theiameta_illumina_pe.minimap2_docker": "${this.minimap2_docker}",
    "theiameta_illumina_pe.minimap2_version": "${this.minimap2_version}",
    "theiameta_illumina_pe.ncbi_scrub_docker": "${this.ncbi_scrub_docker}",
    "theiameta_illumina_pe.num_reads_clean1": "${this.num_reads_clean1}",
    "theiameta_illumina_pe.num_reads_clean2": "${this.num_reads_clean2}",
    "theiameta_illumina_pe.num_reads_clean_pairs": "${this.num_reads_clean_pairs}",
    "theiameta_illumina_pe.num_reads_raw1": "${this.num_reads_raw1}",
    "theiameta_illumina_pe.num_reads_raw2": "${this.num_reads_raw2}",
    "theiameta_illumina_pe.num_reads_raw_pairs": "${this.num_reads_raw_pairs}",
    "theiameta_illumina_pe.percent_coverage": "${this.percent_coverage}",
    "theiameta_illumina_pe.pilon_docker": "${this.pilon_docker}",
    "theiameta_illumina_pe.pilon_version": "${this.pilon_version}",
    "theiameta_illumina_pe.quast_docker": "${this.quast_docker}",
    "theiameta_illumina_pe.quast_version": "${this.quast_version}",
    "theiameta_illumina_pe.read1_clean": "${this.read1_clean}",
    "theiameta_illumina_pe.read1_dehosted": "${this.read1_dehosted}",
    "theiameta_illumina_pe.read1_mapped": "${this.read1_mapped}",
    "theiameta_illumina_pe.read1_unmapped": "${this.read1_unmapped}",
    "theiameta_illumina_pe.read2_clean": "${this.read2_clean}",
    "theiameta_illumina_pe.read2_dehosted": "${this.read2_dehosted}",
    "theiameta_illumina_pe.read2_mapped": "${this.read2_mapped}",
    "theiameta_illumina_pe.read2_unmapped": "${this.read2_unmapped}",
    "theiameta_illumina_pe.samtools_docker": "${this.samtools_docker}",
    "theiameta_illumina_pe.samtools_version": "${this.samtools_version}",
    "theiameta_illumina_pe.theiameta_illumina_pe_analysis_date": "${this.theiameta_illumina_pe_analysis_date}",
    "theiameta_illumina_pe.theiameta_illumina_pe_version": "${this.theiameta_illumina_pe_version}",
    "theiameta_illumina_pe.trimmomatic_docker": "${this.trimmomatic_docker}",
    "theiameta_illumina_pe.trimmomatic_version": "${this.trimmomatic_version}"
}

🧪 Testing

Locally

Tests passed locally with HAV blood sample with commit id 2d1a69b:
miniwdl run -v /home/ines_mendes/Git/public_health_bioinformatics/workflows/metagenomics/wf_theiameta_illumina_pe.wdl read1= ~/Test/HAV_Metagenomics/HAV0024_S8_L001_R1_001.fastq.gz read2= ~/Test/HAV_Metagenomics/HAV0024_S8_L001_R2_001.fastq.gz samplename=HAV0024 reference= ~/Test/HAV_Metagenomics/HAV.fasta

2023-07-14 08:45:50.664 wdl.w:theiameta_illumina_pe done
{
  "dir": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe",
  "outputs": {
    "theiameta_illumina_pe.assembly_fasta": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/assembly_fasta/HAV0024.fasta",
    "theiameta_illumina_pe.assembly_length": 7499,
    "theiameta_illumina_pe.assembly_mean_coverage": 2436.59,
    "theiameta_illumina_pe.average_read_length": 156.4,
    "theiameta_illumina_pe.bbduk_docker": "quay.io/staphb/bbtools:38.76",
    "theiameta_illumina_pe.bedtools_docker": "quay.io/staphb/bedtools:2.31.0",
    "theiameta_illumina_pe.bedtools_version": "v2.31.0",
    "theiameta_illumina_pe.contig_number": 1,
    "theiameta_illumina_pe.fastq_scan_docker": "quay.io/biocontainers/fastq-scan:0.4.4--h7d875b9_1",
    "theiameta_illumina_pe.fastq_scan_version": "fastq-scan 0.4.4",
    "theiameta_illumina_pe.kraken2_docker": "quay.io/staphb/kraken2:2.1.2-no-db",
    "theiameta_illumina_pe.kraken2_percent_human_clean": 6.13,
    "theiameta_illumina_pe.kraken2_percent_human_raw": 69.25,
    "theiameta_illumina_pe.kraken2_report_clean": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/kraken2_report_clean/HAV0024.report.txt",
    "theiameta_illumina_pe.kraken2_report_raw": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/kraken2_report_raw/HAV0024.report.txt",
    "theiameta_illumina_pe.kraken2_version": "2.1.2",
    "theiameta_illumina_pe.largest_contig": 7499,
    "theiameta_illumina_pe.metaspades_docker": "quay.io/biocontainers/spades:3.12.0--h9ee0642_3",
    "theiameta_illumina_pe.metaspades_version": "v3.12.0",
    "theiameta_illumina_pe.minimap2_docker": "staphb/minimap2:2.22",
    "theiameta_illumina_pe.minimap2_version": "2.22-r1101",
    "theiameta_illumina_pe.ncbi_scrub_docker": "us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.1.0",
    "theiameta_illumina_pe.num_reads_clean1": 194976,
    "theiameta_illumina_pe.num_reads_clean2": 194976,
    "theiameta_illumina_pe.num_reads_clean_pairs": "194976",
    "theiameta_illumina_pe.num_reads_raw1": 1607300,
    "theiameta_illumina_pe.num_reads_raw2": 1607300,
    "theiameta_illumina_pe.num_reads_raw_pairs": "1607300",
    "theiameta_illumina_pe.percent_coverage": 99.08,
    "theiameta_illumina_pe.pilon_docker": "quay.io/biocontainers/pilon:1.24--hdfd78af_0",
    "theiameta_illumina_pe.pilon_version": "1.24",
    "theiameta_illumina_pe.quast_docker": "quay.io/staphb/quast:5.0.2",
    "theiameta_illumina_pe.quast_version": "QUAST v5.0.2",
    "theiameta_illumina_pe.read1_clean": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read1_clean/HAV0024_1.clean.fastq.gz",
    "theiameta_illumina_pe.read1_dehosted": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read1_dehosted/HAV0024_R1_dehosted.fastq.gz",
    "theiameta_illumina_pe.read1_mapped": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read1_mapped/assembled_HAV0024_1.fq.gz",
    "theiameta_illumina_pe.read1_unmapped": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read1_unmapped/unassembled_HAV0024_1.fq.gz",
    "theiameta_illumina_pe.read2_clean": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read2_clean/HAV0024_2.clean.fastq.gz",
    "theiameta_illumina_pe.read2_dehosted": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read2_dehosted/HAV0024_R2_dehosted.fastq.gz",
    "theiameta_illumina_pe.read2_mapped": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read2_mapped/assembled_HAV0024_2.fq.gz",
    "theiameta_illumina_pe.read2_unmapped": "/home/ines_mendes/WDL/20230714_082240_theiameta_illumina_pe/out/read2_unmapped/unassembled_HAV0024_2.fq.gz",
    "theiameta_illumina_pe.samtools_docker": "quay.io/staphb/samtools:1.17",
    "theiameta_illumina_pe.samtools_version": "1.17",
    "theiameta_illumina_pe.theiameta_illumina_pe_analysis_date": "2023-07-14",
    "theiameta_illumina_pe.theiameta_illumina_pe_version": "PHB v1.0.0",
    "theiameta_illumina_pe.trimmomatic_docker": "quay.io/staphb/trimmomatic:0.39",
    "theiameta_illumina_pe.trimmomatic_version": "Trimmomatic 0.39"
  }
}

Terra

commit id 2d1a69b

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

cimendes and others added 30 commits April 21, 2023 13:43
…form assembly with ivar, otherwise use shovil with megahit assembler
…d return the one with the highest base count
…and consensus assembly are generated concurrently, and the final contig is selected based on final assembly length (consensus len or total basepairs in aligned de novo contigs)
…apping. allow for the output of sam file format instead of the default PAF file.
…hese reads are now available under `read1_unmapped` and `read2_unmapped`
* added helpful comments and changed read concatenation block to use samplenames that have hyphens instead of underscores. ran successfully with miniwdl, terra testing next

* add frame work  for GHA, working theiaprok_illumina_pe workflow

* add placeholder filters, update filter check

* avoid processing empty list

* debug filtering

* debug filtering

* debug filtering

* debug filtering

* going  back to original filter

* add filter in single file

* working prototype of updated qc_check

* add tests for theiaprok_illumina_se, fix missing slash in shovill se task

* add all theiaprok inputs to qc check task

* remove ani species match from inputs, not used

* add qc check the theiaprok_se

* change workflows to use qc_check_phb task

* fix task name

* parse busco results

* add qc_check_phb to theiaeuk

* put busco results variable in quotes error

* add qc check to theiaprok fasta and ont

* update md5sums

* add qc check to theiacov wfs

* fix more md5sums

* fix output file name

* PHBG v1.3.0 changes - vibrio subworkflow

* update description

* expose min freq input for consensus and variant tasks

* fix variable names

* fix variable types

* fix kraken empty string error

* wdl doesn't have else

* avoid empty string outputs

* variable read as empty if equal to zero, enclose in quotes

* update workflows to remove optional kraken outputs

* update nor reproducible md5s

* fix ci and local not matching

* add quast to theiaprok fasta

* add gc percent to theiaeuk outputs

* add min_freq inputs to theiacov_illumina_se

* add gc percent to theiaprok and theiaeuk qc check

* add num reads to qc check theiaprok theiaeuk

* recursion for assembly length check creates bug so removed

* fix typo in pytest_filter.yml

* add qc_check_phb task check to gha

* update gha md5sums and qc check checks

* typo corrected and fixed spacing on optional input

* updated sra_fetch workflow to use fastq-dl v2.0.1. also exposed optional inputs for docker, disk_size, memory, cpus. tested fine with miniwdl

* fix error on theiaprok

* update checksums

---------

Co-authored-by: kapsakcj <[email protected]>
Co-authored-by: Robert A. Petit III <[email protected]>
Co-authored-by: Sage Wright <[email protected]>
Co-authored-by: Michelle Scribner <[email protected]>
Co-authored-by: kevinlibuit <[email protected]>
Co-authored-by: kevinlibuit <[email protected]>
* added helpful comments and changed read concatenation block to use samplenames that have hyphens instead of underscores. ran successfully with miniwdl, terra testing next

* add frame work  for GHA, working theiaprok_illumina_pe workflow

* add placeholder filters, update filter check

* avoid processing empty list

* debug filtering

* debug filtering

* debug filtering

* debug filtering

* going  back to original filter

* add filter in single file

* working prototype of updated qc_check

* add tests for theiaprok_illumina_se, fix missing slash in shovill se task

* add all theiaprok inputs to qc check task

* remove ani species match from inputs, not used

* add qc check the theiaprok_se

* change workflows to use qc_check_phb task

* fix task name

* parse busco results

* add qc_check_phb to theiaeuk

* put busco results variable in quotes error

* add qc check to theiaprok fasta and ont

* update md5sums

* add qc check to theiacov wfs

* fix more md5sums

* fix output file name

* PHBG v1.3.0 changes - vibrio subworkflow

* update description

* expose min freq input for consensus and variant tasks

* fix variable names

* fix variable types

* fix kraken empty string error

* wdl doesn't have else

* avoid empty string outputs

* variable read as empty if equal to zero, enclose in quotes

* update workflows to remove optional kraken outputs

* update nor reproducible md5s

* fix ci and local not matching

* add quast to theiaprok fasta

* add gc percent to theiaeuk outputs

* add min_freq inputs to theiacov_illumina_se

* add gc percent to theiaprok and theiaeuk qc check

* add num reads to qc check theiaprok theiaeuk

* recursion for assembly length check creates bug so removed

* fix typo in pytest_filter.yml

* add qc_check_phb task check to gha

* update gha md5sums and qc check checks

* typo corrected and fixed spacing on optional input

* updated sra_fetch workflow to use fastq-dl v2.0.1. also exposed optional inputs for docker, disk_size, memory, cpus. tested fine with miniwdl

* fix error on theiaprok

* update checksums

---------

Co-authored-by: kapsakcj <[email protected]>
Co-authored-by: Robert A. Petit III <[email protected]>
Co-authored-by: Sage Wright <[email protected]>
Co-authored-by: Michelle Scribner <[email protected]>
Co-authored-by: kevinlibuit <[email protected]>
Co-authored-by: kevinlibuit <[email protected]>
@cimendes
Copy link
Member Author

cimendes commented Aug 7, 2023

TODO:

  • add assembled_reads_percentage to list of statistics
  • Wrap extra output files and metrics such as assembled_reads and unassembled_reads into optional argument

…ed and unmapped read files, as well as some assembly statistics regarding those files
@andrewjpage andrewjpage force-pushed the im-metagenomics-workflow branch from a831131 to 99e9496 Compare August 17, 2023 10:20
Copy link
Contributor

@jrotieno jrotieno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runs great, well done @cimendes

@jrotieno jrotieno merged commit d9b4b6e into main Sep 20, 2023
30 checks passed
@cimendes cimendes deleted the im-metagenomics-workflow branch September 20, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Viral Metagenomics workflow
5 participants