Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KmerFinder to TheiaProk #188

Merged
merged 27 commits into from
Oct 9, 2023
Merged

KmerFinder to TheiaProk #188

merged 27 commits into from
Oct 9, 2023

Conversation

cimendes
Copy link
Member

@cimendes cimendes commented Sep 14, 2023

Closes #187

🛠️ Changes Being Made

This PR adds the option to run Kmerfinder for bacterial genomes if the user requests it.

Bonus: sistr's sistr_allele_fasta no longer reads sister_allele_fasta (@emmadoughty)

🧠 Context and Rationale

  • The kmerfinder bacterial database was uploaded to RP Google bucket, available at gs://theiagen-public-files-rp/terra/theiaprok-files/kmerfinder_bacteria.tar.gz. It's about 30Gb in size.
  • KmerFinder_bacterial task was implemented
  • a Boolean was added in TheiaProk_illumina_PE, TheiaProk_illumina_SE, TheiaProk_Fasta and TheiaProk_ONT to optionally run Kmerfinder

📋 Workflow/Task Steps

  • The database is decompressed and passed to kmerfinder
  • The resulting txt file is parsed to extract the species ID of the top hit
  • the results file and top hit are outputted

Inputs

  • New input call_kmerfinder set to false by default

Outputs

output {
    String kmerfinder_docker = docker
    File? kmerfinder_results_tsv = "~{samplename}_kmerfinder.tsv"
    String kmerfinder_top_hit = read_string("TOP_HIT")
    String kmerfinder_query_coverage = read_string("QC_METRIC")
    String kmerfinder_template_coverage = read_string("TEMPLATE_COVERAGE")
    String kmerfinder_database = read_string("DATABASE")
  }

🧪 Testing

Locally

Tested task locally with miniwdl

Terra

Underway!!!

TheiaProk_Illumina_PE (n=55): https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/349c9d0f-b2b6-4e94-8156-7680a372a7b6

TheiaProk_Illumina_SE (n=55): https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/47a60d7f-e33b-4cda-97d3-2bb5fd22917b

TheiaProk_FASTA (n=55): https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/ef0e66e5-9d3f-4199-bde3-807b1db7e2f2

Table of differences between Gambit and KmerFinder:
amrfinderplus_testing_sample.zip

TheiaProk_ONT (n=44): https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/6f46335b-1d52-490b-b289-ee274a4fbc86

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

@michellescribner
Copy link
Contributor

@michellescribner
Copy link
Contributor

Thanks for all of the changes @cimendes! I just launched a test of TheiaProk_Illumina_PE_PHB https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/f644e3cd-a563-473f-b9d3-9e9f4c0fbc8b

@michellescribner
Copy link
Contributor

Bug: If kmerfinder does not make any prediction (the result file is empty), the task will populate the string output with the header line. See example sample in this run: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/f644e3cd-a563-473f-b9d3-9e9f4c0fbc8b.

If there is no output from Kmerfinder, the ideal outcome is for these strings to be populated with "No match detected" or similar.

@cimendes
Copy link
Member Author

cimendes commented Oct 2, 2023

Thanks @michellescribner for finding this nasty bug! I've added a patch that fixes this scenario and the headers no longer are populated in the datatable. Instead the "No hit detected in database" string appears!
image

@michellescribner
Copy link
Contributor

@michellescribner
Copy link
Contributor

Tested by @kelseykropp and found that 27/27 samples showed results that matched the kmerfinder online tool. https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Kropp_Sandbox/job_history/be7a6dca-b5b0-4376-92a9-a827bbbc1da8
One sample failed this analysis due to a workflow error at the ANI step. Seems to be an independent bug in the ANI module.

Copy link
Contributor

@michellescribner michellescribner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @cimendes!

@michellescribner michellescribner merged commit 7828f47 into main Oct 9, 2023
12 checks passed
@cimendes cimendes deleted the im-kmerfinder branch October 11, 2023 09:08
sage-wright pushed a commit that referenced this pull request Oct 12, 2023
…l TheiaCov wfs (#208)

* update default nextclade dataset tag for SC2 to "2023-09-21T12:00:00Z". applied on wfs theiacov clearlabs, fasta, ilmn pe, ilmn se, ont

* KmerFinder to TheiaProk (#188)

* skeleton on kmerfinder task

* remove kmerfinder_db_name

* still trying to get kmerfinder to run

* is this working now?!?

* first working version of kmerfinder task

* parse top hit

* make results output file optional

* final version of task (for now..)

* add kmerfinder_bacteria to theiaprok!

* fix typo

* pass empty file for tests - kmerfinder_db

* fix input json

* add kmerfinder query coverage to theiaprok SE, PE, FASTA and ONT

* remove undeeded skip on kmerfinder (set to false by default)

* fix ops

* update md5sum (part1)

* update md5sum (part2)

* fiz sister typo

* missed a file

* update CI

* add kmerfinder_template_coverage to theiaprok output

* expose database name on kmerfinder outputs

* update md5sum

* fix bug - headers being outputed to datatable

* update default nextclade dataset tag for SC2 to "2023-09-21T12:00:00Z". applied on wfs theiacov clearlabs, fasta, ilmn pe, ilmn se, ont

* update CI

---------

Co-authored-by: Inês Mendes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Kmerfinder in theiaprok as optional
2 participants