Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kraken2] Add module to recalculate abundances based on fragment length - Kraken2_ont wf and TheiaCoV_ONT wf #240

Merged
merged 18 commits into from
Feb 20, 2024

Conversation

cimendes
Copy link
Member

@cimendes cimendes commented Nov 8, 2023

Closes #167

🛠️ Changes Being Made

This PR implements a new task kraken2_parse_classified that takes as input the classified reads file from Kraken2, alongside the Kraken2 report. This task computes the abundances based on fragment length for each taxon_id in the classified reads file, and parses the report to populate the taxon name.

The new output report has the following structure: Percent, Num_basepairs, Rank, Taxon_ID, Name
As with Kraken2 report, the header is not included.

98.38943139373141	13934600	U	0	unclassified
0.08279494729112387	11726	D	10239	  Viruses
0.08229363045182063	11655	K	2731360	      Heunggongvirae
0.04975040070043142	7046	C	2731619	          Caudoviricetes
0.33614353195365293	47607	S	2859072	                Vibrio phage BUCT194
0.009560323949529397	1354	S	2886042	                  Shigella virus Moo19
0.31753126169445095	44971	S	2163605	              Pseudomonas phage vB_PaeM_PA5oct
0.04687665487512974	6639	S	12336	              Clostridium phage c-st
0.014517005938133267	2056	S	2969598	              Yangshan Harbor Poseidoniales virus
0.025461246796161748	3606	S	2530023	              Pseudomonas phage Psa21
0.06448629145572525	9133	S1	1450749	                      IAS virus
0.03830484300309969	5425	G	2948731	                Gruunavirus
0.05357029379991103	7587	S1	1391428	                  Escherichia phage 4MG
0.0217966913088606	3087	S	2924887	                    Aeromonas phage ZPAH14
0.08151694239092828	11545	S	2126984	                    Tupanvirus deep ocean
0.01912770869961236	2709	S1	1269028	                    Acanthamoeba polyphaga moumouvirus
0.004716614769782598	668	S1	212035	                    Acanthamoeba polyphaga mimivirus
0.04971509669766358	7041	S	251749	                    Phaeocystis globosa virus
0.06076524956399557	8606	S	2023057	                  Orpheovirus IHUMI-LCC2
0.02257337936975294	3197	S1	256729	                      Lymphocystis disease virus - isolate China
0.024705741136930106	3499	S	336486	                    Turkeypox virus
0.00835292705486948	1183	S	10276	                    Swinepox virus
0.01825216943097008	2585	S	2740746	        Fadolivirus 1
0.07009962789581083	9928	S1	224399	              Adoxophyes honmai nucleopolyhedrovirus
0.10765602604023243	15247	S1	2602116	                      Niukluk phantom virus

Additionally, a previously unknown error has been patched on the kraken2_theiacov tasks that would fail when a target organism was passed due to a syntax error.

Impacted Workflows/Tasks

A new workflow has been added:

  • Kraken2_ONT_PHB

The following workflow has been adjusted to include the new abundance recalculation step:

  • TheiaCoV_ONT_PHB

🧠 Context and Rationale

An assessment was performed to evaluate the performance of Kraken2 on long error-prone Oxford Nanopore reads. In this assessment, the recalculation of abundances based on the number of basepairs (instead of Kraken2's default behaviour of calculating abundances based on fragment number) has successfully computed the expected results.

📋 Workflow/Task Steps

For Kraken2_ONT_PHB, the following steps are taken:

  • Kraken2 task is run with the provided database on the input ONT reads
  • The classified reads file and report file are passed to the new kraken2_parse_classified task where abundances are recalculated

For TheiaCoV_ONT_PHB, the following steps were added:

  • In the read_QC_trim_ont subworkflow, the classified reads file and report file are passed to the new kraken2_parse_classified task where abundances are recalculated for both raw and dehosted reads
  • The Kraken2 outputs were substituted by those obtained from the new task

Inputs

For Kraken2_ONT_PHB:

  • String samplename (mandatory)
  • File read1 (mandatory)
  • File kraken2_db (mandatory)

For TheiaCoV_ONT_PHB:

  • No inputs or outputs have been altered

Outputs

For Kraken2_ONT_PHB:

    # PHB Version Captures
    String kraken2_se_wf_version = version_capture.phb_version
    String kraken2_se_wf_analysis_date = version_capture.date
    # Kraken2
    String kraken2_version = kraken2_se.kraken2_version
    String kraken2_docker = kraken2_se.kraken2_docker
    File kraken2_report = kraken2_recalculate_abundances.kraken_report
    File kraken2_classified_report = kraken2_se.kraken2_classified_report
    File kraken2_unclassified_read1 = kraken2_se.kraken2_unclassified_read1
    File kraken2_classified_read1 = kraken2_se.kraken2_classified_read1

For TheiaCoV_ONT_PHB:

  • No inputs or outputs have been altered

Impacted Outputs

For TheiaCoV_ONT_PHB:

    # Read QC - kraken outputs raw
    Float? kraken_human = read_qc_trim.kraken_human
    Float? kraken_sc2 = read_qc_trim.kraken_sc2
    String? kraken_target_org = read_qc_trim.kraken_target_org
    File? kraken_report = read_qc_trim.kraken_report
    # Read QC - kraken outputs dehosted
    Float? kraken_human_dehosted = read_qc_trim.kraken_human_dehosted
    Float? kraken_sc2_dehosted = read_qc_trim.kraken_sc2_dehosted
    String? kraken_target_org_dehosted = read_qc_trim.kraken_target_org_dehosted
    File? kraken_report_dehosted = read_qc_trim.kraken_report_dehosted

🧪 Testing

Locally

miniwdl run --task kraken2_parse_classified /home/ines_mendes/Git/public_health_bioinformatics/tasks/taxon_id/task_kraken2.wdl kraken2_classified_report= 20231107_110946_kraken2_standalone/out/kraken2_classified_report/ERR3772599.classifiedreads.txt.gz samplename=ERR3772599 kraken2_report= 20231107_110946_kraken2_standalone/out/kraken2_report/ERR3772599.report.txt 
image

Terra

Underway

Kraken2_ONT_PHB
9 in silico samples of human sequences mixed with target viral sequences: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/b795db74-97fa-49f4-a851-1cbdad2b21e8

Scenarios for Reviewer to Test

  • The fix on kraken2_theiacov target organism calculation correctly functions and outputs the result for the input target_org
  • Test Kraken2_ONT_PHB workflow on known abundance samples
  • Test TheiaCoV_ONT_PHB on samples with known abundance and assert the new results

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

@cimendes cimendes marked this pull request as ready for review December 20, 2023 13:27
Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks/taxon_id/task_kraken2.wdl Outdated Show resolved Hide resolved
@sage-wright
Copy link
Member

Testing Kraken2_ONT here and TheiaCoV_ONT here. Will merge upon successful completion.

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results look good!

@sage-wright sage-wright merged commit 4032d22 into main Feb 20, 2024
13 checks passed
@sage-wright sage-wright deleted the im-kraken2-ont branch February 20, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Taxonomic assignment of reads in TheiaProk/TheiaCoV ONT
2 participants