Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clair3_Variants] Clair3_Variants_ONT_PHB workflow #708

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Michal-Babins
Copy link
Contributor

@Michal-Babins Michal-Babins commented Dec 31, 2024

This PR closes #707

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

Here we introduce the Clair3_Variants_ONT_PHB MVP workflow designed to work with ONT data and various chemistries. The workflow aligns ONT reads to a reference genome using minimap2 with long-read specific settings. The aligned reads are then processed into a sorted BAM file with it's index, which are required for variant calling with clair3. We also perform an indexing operation using samtools faidx to index the reference, also required by Clair3. Clair3 performs the variant calling with settings defaulted for haploid organisms, though these can be adjusted for diploid analysis (rare). The workflow produces multiple VCF files including a final merged VCF containing the high-confidence variant calls, along with intermediate files from both pileup and full-alignment approaches.

⚡ Impacted Workflows/Tasks

Workflows:

  • Clair3_Variants_ONT (new)
  • TheiaMeta_Illumina_PHB
  • Freyja_Fastq_PHB
    Task:
  • task_clair3_variants (new)
  • samtools_faidx (new)
  • task_minimap2

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

  • Adds new workflow Clair3_Variants_ONT to standalone workflows
  • Adds new task clair3_variants
  • Creates new task samtools_faidx for indexing fasta
  • Adds long_read_flags Boolean to minimap to handle long read specific formatting in sam / bam formats. If set to true the following flags are set -L --cs --MD.
  • Updates Freyja_Fastq and ThieaMeta_Illumina minimap2 tasks and sets long_read_flags = false` so those options do not get exposed for the tasks.
  • Add new documentation for Clair3_Variants
  • Adds new docker image to theiagen builds: us-docker.pkg.dev/general-theiagen/theiagen/clair3-extra-models:1.0.10

NOTE: Docker build uses staphb docker as base, but adds models from here deemed latest. Does not download deprecated model configurations.

⚙️ Algorithm

Clair3_variants_ont takes the following approach:

  1. Reads are aligned to a reference genome using minimap2 in ont mode with specific long-read flags (-L --cs --MD) to handle long CIGAR strings and provide detailed alignment information.
  2. The resulting SAM file is converted to BAM format, sorted, and indexed using samtools, while simultaneously the reference genome is indexed with samtools faidx.
  3. Clair3 performs variant calling using these prepared files, with default settings optimized for haploid genomes (disable_phasing=true, enable_haploid_precise=true, include_all_contigs=true).
  4. Variants are called using two approaches: a pileup-based approach for high-confidence regions and a full-alignment approach for more complex regions, which are then merged into a final VCF.
  5. The workflow generates multiple output VCF files (pileup, full-alignment, and merged final), with an optional gvcf output that includes non-variant sites, while maintaining version tracking throughout all steps.

➡️ Inputs

File read1
File reference_genome_file
String samplename
String? clair3_model
String? clair3_docker
Int? clair3_memory
Int? clair3_cpu
Int? clair3_disk_size
Int? clair3_variant_quality
Boolean? clair3_include_all_contigs
Boolean? clair3_enable_haploid_precise
Boolean? clair3_disable_phasing
Boolean? clair3_enable_gvcf
Boolean? clair3_enable_long_indel

⬅️ Outputs

String samtools_version = sam_to_sorted_bam.samtools_version
File aligned_bam = sam_to_sorted_bam.bam
File aligned_bai = sam_to_sorted_bam.bai
# Data handling - Reference genome - samtools version
File aligned_fai = samtools_faidx.fai
#Clair3 variant calling
String clair3_variants_wf_version = version_capture.phb_version
String clair3_version = clair3_variants.clair3_version
File clair3_variants_final_vcf = clair3_variants.clair3_variants_final_vcf
File clair3_variants_pileup_vcf = clair3_variants.clair3_variants_pileup_vcf
File clair3_variants_full_alignment_vcf = clair3_variants.clair3_variants_full_alignment_vcf
File? clair3_variants_gvcf = clair3_variants.clair3_variants_gvcf
String clair3_docker_image = clair3_variants.clair3_variants_docker_image
String clair3_model_used = clair3_variants.clair3_model_used

🧪 Testing

Tested against dataset from this paper minus 4 who didn't have a reference. Test found here
Test against a few A. fumagatis ont
Test with clair3 model set to ont
Test with clair3 model set to ont_guppy2
Test with clair3 model set to ont_guppy5
Test with clair3 model set to r941_prom_sup_g5014
Test with clair3 model set to r941_prom_hac_g360+g422
Test with clair3 model set to r941_prom_hac_g238
Test with clair3 model set to r1041_e82_400bps_sup_v500
Test with clair3 model set to r1041_e82_400bps_hac_v500
Test with clair3 model set to r1041_e82_400bps_sup_v410
Test with clair3 model set to r1041_e82_400bps_hac_v410
Test with diploid settings, note that diploid settings will only work with reference that has chromosomes specified in the header, see test expected to fail. Non-human should always use --include-all-ctgs flag.
Test with gvcf enabled
Test with long indels enabled
Test TheiaMeta_Illumina_PE to make sure minimap2 update works as expected.
Test Freyja_Fastq with ont set to true to make sure minimap2 update works as expected.

Suggested Scenarios for Reviewer to Test

Test with the clair3_combined datadable, not all but some. Any ont data with a reference is valid to test. Test different parameters. Just note that for haploids, all haploid settings should always be true. Diploid get's a little trickier because it expects chromosomes in the reference which really just exists for human.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable
    • You have updated the latest version for any affected worklows in the respective workflow documentation page and for every entry in the three workflows_overview tables.

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

…Clair3 workflow to use the correct reference genome file
…and reference files with expected names for compatibility
…meters for haploid analysis and model configurations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Workflow] Clair3_variants
2 participants