Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TheiaCoV_FASTA: Adding five new organisms #194

Merged
merged 48 commits into from
Dec 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
4aa6813
theiacov_fasta workflow updates to include more pathogens
jrotieno Sep 14, 2023
a5f6756
theiacov fasta set level implementation
jrotieno Sep 15, 2023
fc35ef1
theiacov fasta set level implementation
jrotieno Sep 15, 2023
193810a
theiacov fasta set level implementation
jrotieno Sep 15, 2023
dea1595
remaned wf theiacov_fasta_prep
jrotieno Sep 15, 2023
8057490
adding wf_theiacov_fasta_set to dockstore
jrotieno Sep 18, 2023
123a396
theiacov fasta set level implementation
jrotieno Sep 18, 2023
5083540
rename wf to caps
jrotieno Sep 18, 2023
26c76eb
wf renaming
jrotieno Sep 18, 2023
89851d2
WF RENAMING
jrotieno Sep 18, 2023
abfe672
theiacov_fasta set updates
jrotieno Sep 20, 2023
ba690fa
fixing error on empty optional values
jrotieno Sep 20, 2023
fc68b03
update to empty output
jrotieno Sep 20, 2023
b5da94b
adding abricate typing for influenza and using that as input for next…
jrotieno Sep 20, 2023
bff73df
fixing error on empty optional values
jrotieno Sep 20, 2023
3bbe2dd
old tags for abricate
jrotieno Sep 20, 2023
106f5a9
Fix for when only HA or NA is present in assembly
jrotieno Sep 20, 2023
76348db
conditional fix
jrotieno Sep 20, 2023
5cf53e6
fix for when flu is untyped and assembly QC cannot be done
jrotieno Sep 20, 2023
74f6c38
type fix
jrotieno Sep 20, 2023
214ba30
fixed empty file
jrotieno Sep 20, 2023
d2eee66
dummy file
jrotieno Sep 20, 2023
0b001cf
.
jrotieno Sep 20, 2023
5eef197
.
jrotieno Sep 20, 2023
03b6b88
fix for nextclade parser taking no inputs
jrotieno Sep 21, 2023
3d47daa
.
jrotieno Sep 21, 2023
b460287
fix CI error
jrotieno Sep 21, 2023
0b1fd31
wf name changes
jrotieno Sep 21, 2023
9e63f7d
removing unnecessary tasks and workflows
jrotieno Sep 28, 2023
daaaa8d
this may break everything
sage-wright Nov 28, 2023
c014509
i did break everything, here's to hopefully fixing it
sage-wright Nov 29, 2023
d9d75b9
prevent sc2 default overwriting other organisms
sage-wright Nov 29, 2023
9181dd6
only run nextclade if dataset tag found
sage-wright Nov 29, 2023
285de0c
provide default for output nextclade string for WNV failure prevention
sage-wright Nov 29, 2023
4e7b906
change to dataset tag instead of dataset name to accomodate for Yamagata
sage-wright Nov 29, 2023
3d44f7d
Merge branch 'main' into jro_theiacov_fasta_opt
sage-wright Nov 29, 2023
e8fd083
add quotes to prevent consensus qc error
sage-wright Nov 29, 2023
f433b02
update optiosns so vadr runs correctly
sage-wright Nov 29, 2023
638d97f
update defaults
sage-wright Nov 29, 2023
ca03e33
update checksums
sage-wright Nov 29, 2023
cc9c371
provide fake reference for cromwell test
sage-wright Nov 29, 2023
b967881
last checksum update????
sage-wright Nov 29, 2023
26f7c5a
nested variable declarations
sage-wright Dec 6, 2023
277efa4
add default
sage-wright Dec 6, 2023
518fddb
better way of setting variables!!!
sage-wright Dec 7, 2023
86aec0b
bye bye!
sage-wright Dec 7, 2023
4e89707
defaults to prevent failures
sage-wright Dec 8, 2023
cca9459
update nextclade tags
sage-wright Dec 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 18 additions & 8 deletions tasks/gene_typing/task_abricate.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,13 @@ task abricate_flu {
File assembly
String samplename
String database = "insaflu"
String nextclade_flu_h1n1_ha_tag
String nextclade_flu_h1n1_na_tag
String nextclade_flu_h3n2_ha_tag
String nextclade_flu_h3n2_na_tag
String nextclade_flu_vic_ha_tag
String nextclade_flu_vic_na_tag
String nextclade_flu_yam_tag
String? nextclade_flu_h1n1_ha_tag
String? nextclade_flu_h1n1_na_tag
String? nextclade_flu_h3n2_ha_tag
String? nextclade_flu_h3n2_na_tag
String? nextclade_flu_vic_ha_tag
String? nextclade_flu_vic_na_tag
String? nextclade_flu_yam_tag
kevinlibuit marked this conversation as resolved.
Show resolved Hide resolved
Int minid = 70
Int mincov =60
Int cpu = 2
Expand All @@ -91,7 +91,17 @@ task abricate_flu {
cat ~{samplename}_abricate_hits.tsv | awk -F '\t' '{if ($6=="M1") print $15}' > FLU_TYPE
HA_hit=$(cat ~{samplename}_abricate_hits.tsv | awk -F '\t' '{if ($6=="HA") print $15 }')
NA_hit=$(cat ~{samplename}_abricate_hits.tsv | awk -F '\t' '{if ($6=="NA") print $15 }')
flu_subtype="${HA_hit}${NA_hit}" && echo "$flu_subtype" > FLU_SUBTYPE
if [[ ! (-z "${HA_hit}") && ! (-z "${NA_hit}") ]]; then
flu_subtype="${HA_hit}${NA_hit}" && echo "$flu_subtype" > FLU_SUBTYPE
fi
if [[ -z "${HA_hit}" ]]; then
flu_subtype="${NA_hit}" && echo "$flu_subtype" > FLU_SUBTYPE
elif [[ -z "${NA_hit}" ]]; then
flu_subtype="${HA_hit}" && echo "$flu_subtype" > FLU_SUBTYPE
else
flu_subtype="${HA_hit}${NA_hit}" && echo "$flu_subtype" > FLU_SUBTYPE
fi
#flu_subtype="${HA_hit}${NA_hit}" && echo "$flu_subtype" > FLU_SUBTYPE

# set nextclade variables based on subptype
run_nextclade=true
Expand Down
4 changes: 2 additions & 2 deletions tasks/quality_control/task_consensus_qc.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ task consensus_qc {
Int disk_size = 100
}
command <<<
if [ ~{reference_genome} ] ; then
if [ -s "~{reference_genome}" ] ; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, good add. Did we get issues with empty reference files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. In one of the iterations of this solution, I had an empty fasta as a placeholder, but that led to failures. This is also just more robust so I kept it in.

GENOME_LEN=$(grep -v ">" ~{reference_genome} | tr --delete '\n' | wc -c)
elif [ ~{genome_length} ] ; then
GENOME_LEN=~{genome_length}
Expand All @@ -27,7 +27,7 @@ task consensus_qc {
num_ACTG=$( grep -v ">" ~{assembly_fasta} | grep -o -E "C|A|T|G" | wc -l )
if [ -z "$num_ACTG" ] ; then num_ACTG="0" ; fi
echo $num_ACTG | tee NUM_ACTG

# calculate percent coverage (Wu Han-1 genome length: 29903bp)
python3 -c "print ( round( ($num_ACTG / $GENOME_LEN ) * 100, 2 ) )" | tee PERCENT_REF_COVERAGE

Expand Down
2 changes: 1 addition & 1 deletion tasks/quality_control/task_vadr.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ task vadr {
String vadr_opts = "--noseqnamemax --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --out_allfasta"
Int assembly_length_unambiguous
Int skip_length = 10000
String docker = "us-docker.pkg.dev/general-theiagen/staphb/vadr:1.5"
String docker = "us-docker.pkg.dev/general-theiagen/staphb/vadr:1.5.1"
Int minlen = 50
Int maxlen = 30000
Int cpu = 2
Expand Down
145 changes: 73 additions & 72 deletions tasks/taxon_id/task_nextclade.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ task nextclade {
File auspice_json = "~{basename}.nextclade.auspice.json"
File nextclade_tsv = "~{basename}.nextclade.tsv"
String nextclade_docker = docker
String nextclade_dataset_tag = "~{dataset_tag}"
}
}

Expand Down Expand Up @@ -168,76 +169,76 @@ task nextclade_output_parser {
}

task nextclade_add_ref {
meta {
description: "Nextclade task to add samples to either a user specified or a nextclade reference tree."
}
input {
File genome_fasta
File? root_sequence
File? reference_tree_json
File? qc_config_json
File? gene_annotations_gff
File? pcr_primers_csv
File? virus_properties
String docker = "us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:2.14.0"
String dataset_name
String? dataset_reference
String? dataset_tag
Int disk_size = 50
}
String basename = basename(genome_fasta, ".fasta")
command <<<
NEXTCLADE_VERSION="$(nextclade --version)"
echo $NEXTCLADE_VERSION > NEXTCLADE_VERSION

nextclade dataset get \
--name="~{dataset_name}" \
~{"--reference " + dataset_reference} \
~{"--tag " + dataset_tag} \
-o nextclade_dataset_dir \
--verbose

# If no referece sequence is provided, use the reference tree from the dataset
if [ -z "~{reference_tree_json}" ]; then
echo "Default dataset reference tree JSON will be used"
cp nextclade_dataset_dir/tree.json reference_tree.json
else
echo "User reference tree JSON will be used"
cp ~{reference_tree_json} reference_tree.json
fi

tree_json="reference_tree.json"

set -e
nextclade run \
--input-dataset=nextclade_dataset_dir/ \
~{"--input-root-seq " + root_sequence} \
--input-tree ${tree_json} \
~{"--input-qc-config " + qc_config_json} \
~{"--input-gene-map " + gene_annotations_gff} \
~{"--input-pcr-primers " + pcr_primers_csv} \
~{"--input-virus-properties " + virus_properties} \
--output-json "~{basename}".nextclade.json \
--output-tsv "~{basename}".nextclade.tsv \
--output-tree "~{basename}".nextclade.auspice.json \
--output-all=. \
"~{genome_fasta}"
>>>
runtime {
docker: "~{docker}"
memory: "8 GB"
cpu: 2
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TES contributions will not be forgotten ✊

dx_instance_type: "mem1_ssd1_v2_x2"
maxRetries: 3
}
output {
String nextclade_version = read_string("NEXTCLADE_VERSION")
File nextclade_json = "~{basename}.nextclade.json"
File auspice_json = "~{basename}.nextclade.auspice.json"
File nextclade_tsv = "~{basename}.nextclade.tsv"
String nextclade_docker = docker
File netclade_ref_tree = "reference_tree.json"
}
meta {
description: "Nextclade task to add samples to either a user specified or a nextclade reference tree."
}
input {
File genome_fasta
File? root_sequence
File? reference_tree_json
File? qc_config_json
File? gene_annotations_gff
File? pcr_primers_csv
File? virus_properties
String docker = "us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:2.14.0"
String dataset_name
String? dataset_reference
String? dataset_tag
kevinlibuit marked this conversation as resolved.
Show resolved Hide resolved
Int disk_size = 50
}
String basename = basename(genome_fasta, ".fasta")
command <<<
NEXTCLADE_VERSION="$(nextclade --version)"
echo $NEXTCLADE_VERSION > NEXTCLADE_VERSION

nextclade dataset get \
--name="~{dataset_name}" \
~{"--reference " + dataset_reference} \
~{"--tag " + dataset_tag} \
-o nextclade_dataset_dir \
--verbose

# If no referece sequence is provided, use the reference tree from the dataset
if [ -z "~{reference_tree_json}" ]; then
echo "Default dataset reference tree JSON will be used"
cp nextclade_dataset_dir/tree.json reference_tree.json
else
echo "User reference tree JSON will be used"
cp ~{reference_tree_json} reference_tree.json
fi

tree_json="reference_tree.json"

set -e
nextclade run \
--input-dataset=nextclade_dataset_dir/ \
~{"--input-root-seq " + root_sequence} \
--input-tree ${tree_json} \
~{"--input-qc-config " + qc_config_json} \
~{"--input-gene-map " + gene_annotations_gff} \
~{"--input-pcr-primers " + pcr_primers_csv} \
~{"--input-virus-properties " + virus_properties} \
--output-json "~{basename}".nextclade.json \
--output-tsv "~{basename}".nextclade.tsv \
--output-tree "~{basename}".nextclade.auspice.json \
--output-all=. \
"~{genome_fasta}"
>>>
runtime {
docker: "~{docker}"
memory: "8 GB"
cpu: 2
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
dx_instance_type: "mem1_ssd1_v2_x2"
maxRetries: 3
}
output {
String nextclade_version = read_string("NEXTCLADE_VERSION")
File nextclade_json = "~{basename}.nextclade.json"
File auspice_json = "~{basename}.nextclade.auspice.json"
File nextclade_tsv = "~{basename}.nextclade.tsv"
String nextclade_docker = docker
File netclade_ref_tree = "reference_tree.json"
}
}
3 changes: 2 additions & 1 deletion tests/inputs/theiacov/wf_theiacov_fasta.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
"theiacov_fasta.samplename": "fasta",
"theiacov_fasta.assembly_fasta": "tests/data/theiacov/fasta/clearlabs.fasta.gz",
"theiacov_fasta.seq_method": "clearlabs",
"theiacov_fasta.input_assembly_method": "clearlabs"
"theiacov_fasta.input_assembly_method": "clearlabs",
"theiacov_fasta.reference_genome": "tests/inputs/completely-empty-for-test.txt"
}
2 changes: 1 addition & 1 deletion tests/workflows/theiacov/test_wf_theiacov_clearlabs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@
- path: miniwdl_run/call-consensus/work/primer-schemes/SARS-CoV-2/Vuser/SARS-CoV-2.scheme.bed
md5sum: d5ad850f8c610dc45162957ab84530d6
- path: miniwdl_run/call-consensus_qc/command
md5sum: 1736bbc2b16e75dbeb37076bacedc129
md5sum: 3ded305519281d6609fda355bf1c060b
- path: miniwdl_run/call-consensus_qc/inputs.json
contains: ["assembly_fasta", "medaka"]
- path: miniwdl_run/call-consensus_qc/outputs.json
Expand Down
12 changes: 6 additions & 6 deletions tests/workflows/theiacov/test_wf_theiacov_fasta.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
- wf_theiacov_fasta_miniwdl
files:
- path: miniwdl_run/call-consensus_qc/command
md5sum: b89c8a9a0b9e27b26454ba7d668d68f4
md5sum: 57cce4e7c41e1ff0f9a9883605d84695
- path: miniwdl_run/call-consensus_qc/inputs.json
- path: miniwdl_run/call-consensus_qc/outputs.json
- path: miniwdl_run/call-consensus_qc/stderr.txt
Expand All @@ -38,7 +38,7 @@
md5sum: 6808ca805661622ad65ae014a4b2a094
- path: miniwdl_run/call-consensus_qc/work/_miniwdl_inputs/0/clearlabs.fasta.gz
- path: miniwdl_run/call-nextclade/command
md5sum: ed29cde6f430eff4c408d9ea214ebe85
md5sum: b5ecaad831316b3bd8f066f1e71cc0a5
- path: miniwdl_run/call-nextclade/inputs.json
- path: miniwdl_run/call-nextclade/outputs.json
- path: miniwdl_run/call-nextclade/stderr.txt
Expand Down Expand Up @@ -69,11 +69,11 @@
- path: miniwdl_run/call-nextclade/work/nextclade_dataset_dir/reference.fasta
md5sum: c7ce05f28e4ec0322c96f24e064ef55c
- path: miniwdl_run/call-nextclade/work/nextclade_dataset_dir/sequences.fasta
md5sum: ea475ab0a62a0a68fc3b1108fdff8a20
md5sum: bb6b4e9e91304a396724bcb6344b8a5d
- path: miniwdl_run/call-nextclade/work/nextclade_dataset_dir/tag.json
md5sum: 6a17b1ee5449279af7bdd0922545d7b8
md5sum: 97e1309e683fbaaa839198d88cd4e2d9
- path: miniwdl_run/call-nextclade/work/nextclade_dataset_dir/tree.json
md5sum: 13eb330629b6ef17a070fcb6283bea2f
md5sum: 6892e6019bf88ec571b4560d66d3acb0
- path: miniwdl_run/call-nextclade/work/nextclade_dataset_dir/virus_properties.json
- path: miniwdl_run/call-nextclade/work/nextclade_gene_E.translation.fasta
md5sum: dc43b1e98245a25c142aec52b29a07df
Expand Down Expand Up @@ -149,7 +149,7 @@
md5sum: f4ad614b7ad39f28a8145cec280a93c0
- path: miniwdl_run/call-vadr/inputs.json
- path: miniwdl_run/call-vadr/outputs.json
md5sum: f58a2654f9ba9d49617f643b59ae739f
md5sum: e35217438ca21b347ef68e157c480c2e
- path: miniwdl_run/call-vadr/stderr.txt
- path: miniwdl_run/call-vadr/stderr.txt.offset
- path: miniwdl_run/call-vadr/stdout.txt
Expand Down
2 changes: 1 addition & 1 deletion tests/workflows/theiacov/test_wf_theiacov_ont.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@
- path: miniwdl_run/call-consensus/work/primer-schemes/SARS-CoV-2/Vuser/SARS-CoV-2.scheme.bed
md5sum: d5ad850f8c610dc45162957ab84530d6
- path: miniwdl_run/call-consensus_qc/command
md5sum: 770764cd13027f258bf2a871c720c80d
md5sum: 2b043e77f5254e0a8002aa32693edeb8
- path: miniwdl_run/call-consensus_qc/inputs.json
contains: ["assembly_fasta", "medaka"]
- path: miniwdl_run/call-consensus_qc/outputs.json
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -550,7 +550,7 @@
- path: miniwdl_run/wdl/tasks/assembly/task_shovill.wdl
md5sum: ca45f97152cb9536f2bb0603382021bd
- path: miniwdl_run/wdl/tasks/gene_typing/task_abricate.wdl
md5sum: 8ea4befaa7a09b0def8d033cb9b806d1
md5sum: 49018b0dc2b173bc9e0f3893b8be8e7c
- path: miniwdl_run/wdl/tasks/gene_typing/task_amrfinderplus.wdl
md5sum: 249db321d15832002c4945394ae9af76
- path: miniwdl_run/wdl/tasks/gene_typing/task_bakta.wdl
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -518,7 +518,7 @@
- path: miniwdl_run/wdl/tasks/assembly/task_shovill.wdl
md5sum: ca45f97152cb9536f2bb0603382021bd
- path: miniwdl_run/wdl/tasks/gene_typing/task_abricate.wdl
md5sum: 8ea4befaa7a09b0def8d033cb9b806d1
md5sum: 49018b0dc2b173bc9e0f3893b8be8e7c
- path: miniwdl_run/wdl/tasks/gene_typing/task_amrfinderplus.wdl
md5sum: 249db321d15832002c4945394ae9af76
- path: miniwdl_run/wdl/tasks/gene_typing/task_bakta.wdl
Expand Down
Loading