Pipeline purpose and official outputs. #22

alkaZeltser · 2023-10-25T22:23:36Z

alkaZeltser
Oct 25, 2023
Maintainer

This pipeline began as a way to streamline a workflow for extracting and formatting per-base read depth from a targeted sequencing experiment, with the resulting file being used downstream to plot various coverage statistics. Over time, I (and others, shout-out to @bethneilsen) have added additional features and arguably entirely separate workflows. These are all very useful things and still fit very much into the context of coverage-based targeted sequencing experiment analyses. But the time has come to begin integrating this pipeline with other lab infrastructure (ie metapipeline) and due to its increased complexity, I need some outside input on what the main focus of this pipeline should be. This will inform how various aspects of the pipeline should be parameterized. For example, it is not entirely clear which outputs should be considered defaults, which outputs should be considered QC, which outputs should be optional, and which outputs should be part of a different pipeline entirely.

I have assembled a flow chart of all the current functionality (excluding a few nitty gritty input handling details). I perceive the pipeline to be split between two categories of tasks: target-focused processes and off-target focused processes. The target-focus workflow consists of the original per-base read depth within targeted region calculation + the addition of picard collectHSmetrics calculation. The off-target workflow consists of calculating coverage at specifically all dbSNP loci, then outputting variations of that information (filtered in different ways) that may be useful in different scenarios. One of these outputs is designed to be used as an input to pipeline-call-gSNP. So for now I have marked this file as the primary output of the pipeline, and classified the rest as QC features, with the implication that many of them could be treated as optional and should only outputted if specified by the user. To clarify, this primary outptut is a BED coordinate file that consists of the intervals provided in the original Target File input, merged with intervals that represent off-target known polymorphic sites (from dbSNP) enriched in coverage. This file can then be used as an input to call-gSNP to direct variant calling to targeted and enriched off-target regions of a targeted sequencing experiment (WXS is under this umbrella).

There have also been discussions of using this pipeline as the basis for the sample-level QC pipeline (pipeline-SQC-DNA). In that case, I could see all the output classifications being flipped, with QC files being treated as primary outputs instead.

@yashpatel6 @tyamaguchi-ucla @sorelfitzgibbon @bethneilsen @pboutros I would appreciate any suggestions you may have on what direction to take with organizing the outputs of this pipeline.

sorelfitzgibbon · 2023-10-31T17:46:33Z

sorelfitzgibbon
Oct 31, 2023
Maintainer

@alkaZeltser do you have any numbers (or ball-park estimates) on the comparison of time and resources for running this off-target pipeline vs the extra time to just run call-gSNP on the entire genome?

@alkaZeltser @yashpatel6 @tyamaguchi-ucla @bethneilsen I would appreciate a meeting to discuss this, SQC-DNA and CQC-DNA pipelines. I'm supposed to work on these, but am unclear what to start on and where to help.

1 reply

alkaZeltser Oct 31, 2023
Maintainer Author

do you have any numbers (or ball-park estimates) on the comparison of time and resources for running this off-target pipeline vs the extra time to just run call-gSNP on the entire genome?

Yeah so I actually did some comparisons back in the day. This was pre-metapipeline.
An example log for a genome-wide run is here: /hot/data/unregistered/Boutros-Zeltser-PRAD-GPT1/call-gSNP/whole-genome/call-gSNP-7.2.0/BZPRGPT1000001/log-call-gSNP-7.2.0-20220323-232310/nextflow-log
And logs for the same sample but enriched off-target and targeted regions only here:
/hot/data/unregistered/Boutros-Zeltser-PRAD-GPT1/call-gSNP/expanded-target-with-off-target-intervals/call-gSNP-9.1.0/BZPRGPT1000001-N001-B01-F/log-call-gSNP-9.1.0-20221105T085436Z/nextflow-log

I don't think the reports got generated properly but the trace files look fine. I don't really have time to parse through and add the runtimes of every process atm but for the record they exist.

More importantly though, calling across the whole genome significantly altered the within-target variant callset. We think it has something to do with how the VQSR algorithm is applied. Essentially, many variants that were called within target regions when only those regions were being called were NOT called when calling was applied genome-wide. If you trust the targeted calling to be the more accurate of the two, then we basically end up with a lot of false negatives in regions that we are actually much more confident in. Here's a figure from one of my old presentations showing site concordance after filtering for a minimum read depth of I believe 20x. We gained something like ~8K variants from indiscriminate genome-wide calling, and we LOST 12K from the targeted regions.

When I called over targeted regions + only enriched off-target regions I retained pretty much all targeted variants and gained a good amount of genome-wide variants, so for targeted sequencing this is definitely a preferred strategy.

sorelfitzgibbon · 2023-11-01T15:32:22Z

sorelfitzgibbon
Nov 1, 2023
Maintainer

Interesting. I guess that makes sense. Something like the landscape of errors (& maybe true variants) is different outside of genes and with more variable coverage, and so mixing the two decreases the predictive power. I hope you don’t mind me continuing to spew out thoughts that you’ve probably already considered: How about just calling variants separately on the entire complement of the targeted regions? It seems to me that GATK would quickly skip over the lower coverage and lower quality regions (could be tweaked for stringency).

2 replies

alkaZeltser Nov 1, 2023
Maintainer Author

I hope you don’t mind me continuing to spew out thoughts that you’ve probably already considered

Haha not at all! You ask very good questions, which is very affirming when I know the answers and very scary when I don't!

How about just calling variants separately on the entire complement of the targeted regions?

I had to remind myself what complement means LOL - you are referring to calling just off-target regions separately from target regions? I also tried this, I actually don't remember having tried this but I have found the configs and outputs to prove it. /hot/data/unregistered/Boutros-Zeltser-PRAD-GPT1/call-gSNP/off-target-intervals/call-gSNP-7.2.0

I think that's a perfectly reasonable approach, and I don't foresee runtime being a big issue either way. The only problem would be the inconvenience of having to run variant calling twice, which means you have to either run metapipeline twice or I guess 1.5 times with a new config? You also have to deal with two callsets in separate output filesets which isn't too difficult but does create extra work and "entropy" downstream. Ideally, pipeline-targeted-coverage, as a step between BAM recalibration and variant calling in the metapipeline, would keep things more streamlined while not compromising too much in terms of information loss.

I forgot to mention in my last comment that pipeline-targeted-coverage only takes about 10-15 minutes or so to run on the germline targeted panel (my only real-life test so far), but likely longer for WXS. For context, these data tend to run through germline-relevant portions of the metapipeline (alignment + call-gSNP + call-gSV + call-mtSNV) in 2.5 hours.

sorelfitzgibbon Nov 2, 2023
Maintainer

and very scary when I don't!
Rare!

Thanks, I'm sold on your method. Adding a second gSNP run, followed by a merge would be lead to a messy VCF with e.g. I think two sets of VQSR odds, and keeping them separate would just add that "entropy" downstream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline purpose and official outputs. #22

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pipeline purpose and official outputs. #22

alkaZeltser Oct 25, 2023 Maintainer

Replies: 2 comments · 3 replies

sorelfitzgibbon Oct 31, 2023 Maintainer

alkaZeltser Oct 31, 2023 Maintainer Author

sorelfitzgibbon Nov 1, 2023 Maintainer

alkaZeltser Nov 1, 2023 Maintainer Author

sorelfitzgibbon Nov 2, 2023 Maintainer

alkaZeltser
Oct 25, 2023
Maintainer

Replies: 2 comments 3 replies

sorelfitzgibbon
Oct 31, 2023
Maintainer

alkaZeltser Oct 31, 2023
Maintainer Author

sorelfitzgibbon
Nov 1, 2023
Maintainer

alkaZeltser Nov 1, 2023
Maintainer Author

sorelfitzgibbon Nov 2, 2023
Maintainer