Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TheiaMeta: Viral Metagenomics workflow #64

Merged
merged 87 commits into from
Sep 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
99c95fa
first step of theiameta workflow - QC of raw reads with read scrubber…
cimendes Apr 21, 2023
508c7d9
add tasks to retrieve reads that map to a reference genome
cimendes Apr 21, 2023
c7f0763
add retrieve mapped reads and shovil task
cimendes Apr 21, 2023
866adc8
save unmapped reads
cimendes Apr 24, 2023
6d79da4
alter approach for consensus assembly: if a reference is provided per…
cimendes Apr 24, 2023
1e19c5b
fix bug: make reference input optional
cimendes Apr 24, 2023
b2159ea
update description
cimendes Apr 24, 2023
f7807f6
add theiameta to dockstore
cimendes Apr 26, 2023
604827f
add assembly_len metric
cimendes Apr 26, 2023
2ce9ec2
add largest contig to quast metric output
cimendes Apr 26, 2023
edff82c
fix typo
cimendes May 8, 2023
807f17c
add minimpa2 task
cimendes May 9, 2023
7d7a7e5
update minimap2 description
cimendes May 9, 2023
5ca7e0c
adjust presets
cimendes May 9, 2023
88778f8
update comments
cimendes May 9, 2023
2a7d961
add task to parse paf files and return mappend contigs in a FASTA file
cimendes May 9, 2023
d9cd18e
fix typo
cimendes May 9, 2023
fdea0cf
add task to compare two assemblies, one consensus and one de novo, an…
cimendes May 9, 2023
dee93e2
Add two pronged approach to reference based assembly branch: De novo …
cimendes May 9, 2023
a6c911a
fix miniwdl error with optional input
cimendes May 9, 2023
430484e
add option to have a second query file to allow for paired-end read m…
cimendes May 16, 2023
ceb7f8d
add new tasks for the retrieval of unmapped reads from an assembly. T…
cimendes May 16, 2023
0791cec
simplify retrieve_unaligned_pe_reads_sam to return the fastq.gz files…
cimendes May 17, 2023
796c378
Update metagenomics branch with main (#65)
cimendes May 22, 2023
b927fe3
add -k option to ivar_consensus
cimendes May 23, 2023
7bc5bcc
add -k option to ivar_consensus (#69)
cimendes May 23, 2023
c6b0ca0
undo -k param
cimendes May 23, 2023
36e221c
Update with main (#71)
cimendes May 23, 2023
08d613a
make ivar functional after merge update
cimendes May 23, 2023
c926bb0
Merge branch 'im-metagenomics-dev' into im-metagenomics-workflow
cimendes May 23, 2023
fd0c2b9
TheiaMeta: Simplify metagenomics workflow (#75)
cimendes Jun 28, 2023
a41f0fb
Merge branch 'main' into im-metagenomics-workflow
cimendes Jun 29, 2023
f1c7c19
change retrieve_unaligned_pe_reads_sam to retrieve_pe_reads_sam to a…
cimendes Jul 4, 2023
45f58ea
calculate percent coverage when performing reference-based assembly
cimendes Jul 5, 2023
fe31c4b
change docker containers to quay.io; remove bc dependency from calcul…
cimendes Jul 5, 2023
4904b3a
update CI to create extra storage space
kapsakcj Jul 5, 2023
fcb8b1e
update CI theiaprok_illumina_se
kapsakcj Jul 5, 2023
d62841c
update CI md5sum
cimendes Jul 5, 2023
32cb63b
remove unused task
cimendes Jul 5, 2023
4062a63
update CI for theiacov_ont miniwdl test
kapsakcj Jul 5, 2023
a49f683
New metadata output on summarize_data task for phylogenetic workflows…
cimendes Jul 6, 2023
571618e
Add default value for kraken_human, add average_read_length metric to…
cimendes Jul 6, 2023
50b311b
remove unused inputs
cimendes Jul 6, 2023
90f6055
change type to String
cimendes Jul 6, 2023
d3dfe8d
remove kraken2_standalone from wf_read_QC_trim_pe subworkflow and add…
cimendes Jul 6, 2023
06a0639
update md5sum
cimendes Jul 6, 2023
35decea
Merge branch 'main' into im-metagenomics-workflow
cimendes Jul 6, 2023
1b06905
exclude wf_theiacov_illumina_pe and wf_theiacov_illumina_se miniwdl t…
cimendes Jul 7, 2023
dd05045
add optional arguments to be passed to megahit
cimendes Jul 7, 2023
e46a039
rename file
cimendes Jul 7, 2023
9b4963f
rename file
cimendes Jul 7, 2023
1d9b75f
change seqkit repository to dockerhub to avoid error on Terra.bio
cimendes Jul 7, 2023
d5c4f6e
update md5sum
cimendes Jul 7, 2023
28992e8
remove output
cimendes Jul 7, 2023
7d5f76e
change metgahit assembler to metaspades (more conservative, higher qu…
cimendes Jul 10, 2023
35a845a
expose metaspades_ops on metaspades process, not on main workflow opt…
cimendes Jul 11, 2023
826bc7c
output docker and version (when possible) of called tools; organize o…
cimendes Jul 11, 2023
14f58cd
expose optional inputs on task process, not on main workflow process
cimendes Jul 11, 2023
71a05c1
fix bug
cimendes Jul 11, 2023
26eba69
expose metaspades optional params
cimendes Jul 11, 2023
b2ef7c7
update ncbi_scrub task, Add kraken2 to pre and post processed reads
cimendes Jul 13, 2023
e092647
update tests
rpetit3 Jul 14, 2023
dccfbb7
fastq-scan use read_int
rpetit3 Jul 14, 2023
03b4572
try alternative read count capture
rpetit3 Jul 14, 2023
2d1a69b
use contains for fastq_Scan test
rpetit3 Jul 14, 2023
7c32661
close Retries not enabled for Pilon Task (TheiaMeta) #136
cimendes Aug 2, 2023
1c791f8
close #136
cimendes Aug 2, 2023
cc88b65
remove unused task
cimendes Aug 2, 2023
0506007
Merge branch 'main' into im-metagenomics-workflow
cimendes Aug 2, 2023
83cf809
add phred_offset option to metaspades (33 as default). Update CI
cimendes Aug 2, 2023
eaf726e
fix typo
cimendes Aug 2, 2023
362564d
update container to use google registry
cimendes Aug 3, 2023
4b89705
update metaspades docker
cimendes Aug 3, 2023
afb9362
add assembled_reads_percent task
cimendes Aug 9, 2023
99e9496
add output_additional_files conditional to control the output of mapp…
cimendes Aug 9, 2023
50043a8
Merge branch 'main' of github.com:theiagen/public_health_bioinformati…
andrewjpage Aug 17, 2023
e852484
Update kraken2_db
cimendes Aug 31, 2023
911b832
Merge branch 'main' into im-metagenomics-workflow
cimendes Sep 1, 2023
c6c2a15
trying to hide optional inputs in Terra
cimendes Sep 13, 2023
87f9afb
change argument from "memory" to "mem" for consistency
cimendes Sep 13, 2023
c204907
update memory to mem as per style guide
cimendes Sep 13, 2023
6322307
hide optional inputs in Terra
cimendes Sep 13, 2023
3844825
update CI
cimendes Sep 13, 2023
faf939c
Merge branch 'main' into im-metagenomics-workflow
cimendes Sep 20, 2023
e58242c
theiaprok - update CI md5sum
cimendes Sep 20, 2023
06aa0da
revert estimated genome size from int to string to make theiacov_ont …
cimendes Sep 20, 2023
eae3084
fix typo
cimendes Sep 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,11 @@ workflows:
primaryDescriptorPath: /workflows/phylogenetics/wf_lyveset.wdl
testParameterFiles:
- empty.json
- name: TheiaMeta_Illumina_PE_PHB
subclass: WDL
primaryDescriptorPath: /workflows/metagenomics/wf_theiameta_illumina_pe.wdl
testParameterFiles:
- empty.json
- name: Snippy_Streamline_PHB
subclass: WDL
primaryDescriptorPath: /workflows/phylogenetics/wf_snippy_streamline.wdl
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/pytest-workflows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,11 @@ jobs:
# For every workflow, test it with MiniWDL and Cromwell
tag: ["${{ fromJson(needs.changes.outputs.workflows) }}"]
engine: ["miniwdl", "cromwell"]
exclude:
- tag: "wf_theiacov_illumina_pe"
engine: "miniwdl"
- tag: "wf_theiacov_illumina_se"
engine: "miniwdl"
defaults:
run:
# Play nicely with miniconda
Expand Down
53 changes: 53 additions & 0 deletions tasks/alignment/task_minimap2.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
version 1.0

task minimap2 {
meta {
description: "Align a query genome to a reference with minimap2"
}
input {
File query1
File? query2
File reference
String samplename
String docker = "us-docker.pkg.dev/general-theiagen/staphb/minimap2:2.22" # newer versions seem to be bugged (infinite loop)
String mode = "asm20"
Boolean output_sam = false
Int disk_size = 100
Int cpu = 2
Int mem = 8
}
command <<<
# Preset options - https://lh3.github.io/minimap2/minimap2.html
# Version capture
minimap2 --version | tee VERSION

if [ -z "~{query2}" ] ; then
INPUT_QUERY="~{query1}"
else
INPUT_QUERY="~{query1} ~{query2}"
fi

# Run minimap2 - output can be sam or paf file depending on ~{output_sam}
minimap2 \
~{true="-a" false="" output_sam} \
-x "~{mode}" \
-t "~{cpu}" \
"~{reference}" \
${INPUT_QUERY} > "~{samplename}"_minimap2.out

>>>
output {
File minimap2_out = "~{samplename}_minimap2.out"
String minimap2_version = read_string("VERSION")
String minimap2_docker = "~{docker}"
}
runtime {
docker: "~{docker}"
memory: mem + " GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
maxRetries: 3
preemptible: 0
}
}
4 changes: 3 additions & 1 deletion tasks/assembly/task_ivar_consensus.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ task consensus {
Float? consensus_min_freq
Int? consensus_min_depth
String char_unknown = "N"
Boolean skip_N = false
Int disk_size = 100
}
command <<<
Expand Down Expand Up @@ -44,7 +45,8 @@ task consensus {
-q ~{min_qual} \
-t ~{consensus_min_freq} \
-m ~{consensus_min_depth} \
-n ~{char_unknown}
-n ~{char_unknown} \
~{true = "-k" false = "" skip_N}

# clean up fasta header
echo ">~{samplename}" > ~{samplename}.ivar.consensus.fasta
Expand Down
46 changes: 46 additions & 0 deletions tasks/assembly/task_metaspades.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
version 1.0

task metaspades_pe {
input {
File read1_cleaned
File read2_cleaned
String samplename
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/spades:3.12.0--h9ee0642_3"
Int disk_size = 100
Int cpu = 4
Int memory = 16
String? kmers
String? metaspades_opts
Int phred_offset = 33
}
command <<<
metaspades.py --version | head -1 | cut -d ' ' -f 2 | tee VERSION
metaspades.py \
-1 ~{read1_cleaned} \
-2 ~{read2_cleaned} \
~{'-k ' + kmers} \
-m ~{memory} \
-t ~{cpu} \
-o metaspades \
--phred-offset ~{phred_offset} \
~{metaspades_opts}

mv metaspades/contigs.fasta ~{samplename}_contigs.fasta

>>>
output {
File assembly_fasta = "~{samplename}_contigs.fasta"
String metaspades_version = read_string("VERSION")
String metaspades_docker = '~{docker}'
}
runtime {
docker: "~{docker}"
memory: "~{memory} GB"
cpu: "~{cpu}"
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
maxRetries: 3
preemptible: 0
}
}

21 changes: 14 additions & 7 deletions tasks/quality_control/task_fastq_scan.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ task fastq_scan_pe {
String read1_name = basename(basename(basename(read1, ".gz"), ".fastq"), ".fq")
String read2_name = basename(basename(basename(read2, ".gz"), ".fastq"), ".fq")
Int disk_size = 100
String docker = "quay.io/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
}
command <<<
# capture date and version
Expand All @@ -21,9 +22,11 @@ task fastq_scan_pe {
fi

# capture forward read stats
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json >(jq .qc_stats.read_total > READ1_SEQS)
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json
cat ~{read1_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ1_SEQS
read1_seqs=$(cat READ1_SEQS)
eval "${cat_reads} ~{read2}" | fastq-scan | tee ~{read2_name}_fastq-scan.json >(jq .qc_stats.read_total > READ2_SEQS)
eval "${cat_reads} ~{read2}" | fastq-scan | tee ~{read2_name}_fastq-scan.json
cat ~{read2_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ2_SEQS
read2_seqs=$(cat READ2_SEQS)

# capture number of read pairs
Expand All @@ -38,11 +41,12 @@ task fastq_scan_pe {
output {
File read1_fastq_scan_report = "~{read1_name}_fastq-scan.json"
File read2_fastq_scan_report = "~{read2_name}_fastq-scan.json"
Int read1_seq = read_string("READ1_SEQS")
Int read2_seq = read_string("READ2_SEQS")
Int read1_seq = read_int("READ1_SEQS")
Int read2_seq = read_int("READ2_SEQS")
String read_pairs = read_string("READ_PAIRS")
String version = read_string("VERSION")
String pipeline_date = read_string("DATE")
String fastq_scan_docker = docker
}
runtime {
docker: "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
Expand All @@ -60,6 +64,7 @@ task fastq_scan_se {
File read1
String read1_name = basename(basename(basename(read1, ".gz"), ".fastq"), ".fq")
Int disk_size = 100
String docker = "quay.io/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
}
command <<<
# capture date and version
Expand All @@ -74,13 +79,15 @@ task fastq_scan_se {
fi

# capture forward read stats
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json >(jq .qc_stats.read_total > READ1_SEQS)
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json
cat ~{read1_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ1_SEQS
>>>
output {
File fastq_scan_report = "~{read1_name}_fastq-scan.json"
Int read1_seq = read_string("READ1_SEQS")
Int read1_seq = read_int("READ1_SEQS")
String version = read_string("VERSION")
String pipeline_date = read_string("DATE")
String fastq_scan_docker = docker
}
runtime {
docker: "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
Expand All @@ -91,4 +98,4 @@ task fastq_scan_se {
preemptible: 0
maxRetries: 3
}
}
}
9 changes: 5 additions & 4 deletions tasks/quality_control/task_ncbi_scrub.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ task ncbi_scrub_pe {
fi

# dehost reads
/opt/scrubber/scripts/scrub.sh -n ${read1_unzip} |& tail -n1 | awk -F" " '{print $1}' > FWD_SPOTS_REMOVED
/opt/scrubber/scripts/scrub.sh -i ${read1_unzip} |& tail -n1 | awk -F" " '{print $1}' > FWD_SPOTS_REMOVED

# gzip dehosted reads
gzip ${read1_unzip}.clean -c > ~{samplename}_R1_dehosted.fastq.gz
Expand All @@ -40,7 +40,7 @@ task ncbi_scrub_pe {
fi

# dehost reads
/opt/scrubber/scripts/scrub.sh -n ${read2_unzip} |& tail -n1 | awk -F" " '{print $1}' > REV_SPOTS_REMOVED
/opt/scrubber/scripts/scrub.sh -i ${read2_unzip} |& tail -n1 | awk -F" " '{print $1}' > REV_SPOTS_REMOVED

# gzip dehosted reads
gzip ${read2_unzip}.clean -c > ~{samplename}_R2_dehosted.fastq.gz
Expand All @@ -51,6 +51,7 @@ task ncbi_scrub_pe {
Int read1_human_spots_removed = read_int("FWD_SPOTS_REMOVED")
Int read2_human_spots_removed = read_int("REV_SPOTS_REMOVED")
String ncbi_scrub_docker = docker

}
runtime {
docker: "~{docker}"
Expand All @@ -67,7 +68,7 @@ task ncbi_scrub_se {
input {
File read1
String samplename
String docker = "gcr.io/ncbi-sys-gcr-public-research/sra-human-scrubber@sha256:b7dba71079344daea4ea3363e1a67fa54edb7ec65459d039669c68a66d38b140"
String docker = "us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.1.0"
Int disk_size = 100
}
String r1_filename = basename(read1)
Expand All @@ -85,7 +86,7 @@ task ncbi_scrub_se {
fi

# dehost reads
/opt/scrubber/scripts/scrub.sh -n ${read1_unzip} |& tail -n1 | awk -F" " '{print $1}' > FWD_SPOTS_REMOVED
/opt/scrubber/scripts/scrub.sh -i ${read1_unzip} |& tail -n1 | awk -F" " '{print $1}' > FWD_SPOTS_REMOVED

# gzip dehosted reads
gzip ${read1_unzip}.clean -c > ~{samplename}_R1_dehosted.fastq.gz
Expand Down
43 changes: 43 additions & 0 deletions tasks/quality_control/task_pilon.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
version 1.0

task pilon {
input {
File assembly
File bam
File bai
String samplename
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/pilon:1.24--hdfd78af_0"
Int cpu = 4
Int memory = 8
Int disk_size = 100
}
command <<<
# version capture
pilon --version | cut -d' ' -f3 | tee VERSION

# run pilon
pilon \
--genome ~{assembly} \
--frags ~{bam} \
--output ~{samplename} \
--outdir pilon \
--changes --vcf

>>>
output {
File assembly_fasta = "pilon/~{samplename}.fasta"
File changes = "pilon/~{samplename}.changes"
File vcf = "pilon/~{samplename}.vcf"
String pilon_version = read_string("VERSION")
String pilon_docker = "~{docker}"
}
runtime {
docker: "~{docker}"
memory: "~{memory} GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
preemptible: 0
maxRetries: 3
}
}
20 changes: 15 additions & 5 deletions tasks/quality_control/task_quast.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,18 @@ task quast {
input {
File assembly
String samplename
Int min_contig_len = 500
String docker = "us-docker.pkg.dev/general-theiagen/staphb/quast:5.0.2"
Int disk_size = 100
Int memory = 2 # added default value
Int mem = 2 # added default value
Int cpu = 2 # added default value
}
command <<<
# capture date and version
date | tee DATE
quast.py --version | grep QUAST | tee VERSION

quast.py ~{assembly} -o .
quast.py ~{assembly} -o . --min-contig ~{min_contig_len}
mv report.tsv ~{samplename}_report.tsv

python <<CODE
Expand All @@ -34,7 +35,13 @@ task quast {
n50_value.write(line[1])
if "GC" in line[0]:
with open("GC_PERCENT", 'wt') as gc_percent:
gc_percent.write(line[1])
gc_percent.write(line[1])
if "Largest contig" in line[0]:
with open("LARGEST_CONTIG", 'wt') as largest_contig:
largest_contig.write(line[1])
if "# N's per 100 kbp" in line[0]:
with open("UNCALLED_BASES", "wt") as uncalled_bases:
uncalled_bases.write(line[1])

CODE

Expand All @@ -46,11 +53,14 @@ task quast {
Int genome_length = read_int("GENOME_LENGTH")
Int number_contigs = read_int("NUMBER_CONTIGS")
Int n50_value = read_int("N50_VALUE")
Float gc_percent = read_float("GC_PERCENT")
Float gc_percent = read_float("GC_PERCENT")
Int largest_contig = read_int("LARGEST_CONTIG")
Float uncalled_bases = read_float("UNCALLED_BASES")
String quast_docker = docker
}
runtime {
docker: "~{docker}"
memory: "~{memory} GB"
memory: "~{mem} GB"
cpu: "~{cpu}"
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
Expand Down
36 changes: 36 additions & 0 deletions tasks/quality_control/task_readlength.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
version 1.0

task readlength {
input {
File read1
File read2
Int memory = 8
String docker = "us-docker.pkg.dev/general-theiagen/staphb/bbtools:38.76"
Int disk_size = 100
}
command <<<
# date and version control
date | tee DATE

readlength.sh in=~{read1} > STDOUT_FORWARD
readlength.sh in=~{read2} > STDOUT_REVERSE

avg_forward=$(cat STDOUT_FORWARD | grep "#Avg:" | cut -f 2)
avg_reverse=$(cat STDOUT_REVERSE | grep "#Avg:" | cut -f 2)

result=$(awk "BEGIN { printf \"%.2f\", ($avg_forward + $avg_reverse ) / 2 }")
echo $result | tee AVERAGE_READ_LENGTH
>>>
output {
Float average_read_length = read_string("AVERAGE_READ_LENGTH")
}
runtime {
docker: "~{docker}"
memory: "~{memory} GB"
cpu: 4
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
preemptible: 0
maxRetries: 3
}
}
Loading