Pipeline purpose and official outputs. #22
Replies: 2 comments 3 replies
-
@alkaZeltser do you have any numbers (or ball-park estimates) on the comparison of time and resources for running this off-target pipeline vs the extra time to just run call-gSNP on the entire genome? @alkaZeltser @yashpatel6 @tyamaguchi-ucla @bethneilsen I would appreciate a meeting to discuss this, SQC-DNA and CQC-DNA pipelines. I'm supposed to work on these, but am unclear what to start on and where to help. |
Beta Was this translation helpful? Give feedback.
-
Interesting. I guess that makes sense. Something like the landscape of errors (& maybe true variants) is different outside of genes and with more variable coverage, and so mixing the two decreases the predictive power. I hope you don’t mind me continuing to spew out thoughts that you’ve probably already considered: How about just calling variants separately on the entire complement of the targeted regions? It seems to me that GATK would quickly skip over the lower coverage and lower quality regions (could be tweaked for stringency). |
Beta Was this translation helpful? Give feedback.
-
This pipeline began as a way to streamline a workflow for extracting and formatting per-base read depth from a targeted sequencing experiment, with the resulting file being used downstream to plot various coverage statistics. Over time, I (and others, shout-out to @bethneilsen) have added additional features and arguably entirely separate workflows. These are all very useful things and still fit very much into the context of coverage-based targeted sequencing experiment analyses. But the time has come to begin integrating this pipeline with other lab infrastructure (ie metapipeline) and due to its increased complexity, I need some outside input on what the main focus of this pipeline should be. This will inform how various aspects of the pipeline should be parameterized. For example, it is not entirely clear which outputs should be considered defaults, which outputs should be considered QC, which outputs should be optional, and which outputs should be part of a different pipeline entirely.
I have assembled a flow chart of all the current functionality (excluding a few nitty gritty input handling details). I perceive the pipeline to be split between two categories of tasks: target-focused processes and off-target focused processes. The target-focus workflow consists of the original per-base read depth within targeted region calculation + the addition of picard collectHSmetrics calculation. The off-target workflow consists of calculating coverage at specifically all dbSNP loci, then outputting variations of that information (filtered in different ways) that may be useful in different scenarios. One of these outputs is designed to be used as an input to pipeline-call-gSNP. So for now I have marked this file as the primary output of the pipeline, and classified the rest as QC features, with the implication that many of them could be treated as optional and should only outputted if specified by the user. To clarify, this primary outptut is a BED coordinate file that consists of the intervals provided in the original Target File input, merged with intervals that represent off-target known polymorphic sites (from dbSNP) enriched in coverage. This file can then be used as an input to call-gSNP to direct variant calling to targeted and enriched off-target regions of a targeted sequencing experiment (WXS is under this umbrella).
There have also been discussions of using this pipeline as the basis for the sample-level QC pipeline (pipeline-SQC-DNA). In that case, I could see all the output classifications being flipped, with QC files being treated as primary outputs instead.
@yashpatel6 @tyamaguchi-ucla @sorelfitzgibbon @bethneilsen @pboutros I would appreciate any suggestions you may have on what direction to take with organizing the outputs of this pipeline.
Beta Was this translation helpful? Give feedback.
All reactions