Skip to content

Commit

Permalink
Update to version 0.1.2
Browse files Browse the repository at this point in the history
  • Loading branch information
tmokveld committed Feb 14, 2024
1 parent f5e162f commit 76b43c7
Show file tree
Hide file tree
Showing 24 changed files with 1,872 additions and 452 deletions.
22 changes: 17 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<h1 align="center">TRGT-denovo</h1>

<h3 align="center">Calling <em>de novo</em> tandem repeats in trios</h3>
<h3 align="center">Calling <em>de novo</em> tandem repeat mutations in trios</h3>

***

Expand All @@ -18,12 +18,13 @@ Please note that TRGT-denovo is still in early development and is subject to sig

* [Latest release with binary](https://github.com/PacificBiosciences/trgt-denovo/releases/latest)

To build TRGT-denovo you need a working C compiler. It was tested on Linux with Clang 13.0.0 & GCC 11.3.0 and on Mac OSX (M1) with Clang 15.0.7 & GCC 14.0.0.
To build TRGT-denovo you need a working C compiler: it was tested on Linux with Clang 13.0.0 & GCC 11.3.0 and on Mac OSX (M1) with Clang 15.0.7 & GCC 14.0.0.

## Documentation

* [Example run](docs/example.md)
* [Output](docs/output.md)
* [Getting started](docs/example.md)
* [Output format](docs/output.md)
* [Interpretation](docs/interpretation.md)
* [Command-line interface](docs/cli.md)

## Need help?
Expand All @@ -38,5 +39,16 @@ As TRGT-denovo is not covered by any service level agreement or the like, please
Please report all issues through GitHub instead.
We make no warranty that any such issue will be addressed, to any extent or within any time frame.

## Changelog

- 0.1.2
- With the recent changes to TRGT, within-sample partitioning is now alignment-free in TRGT-denovo
- Homozygous alleles are no longer collapsed: de novo evidence will now always gathered and be specific to a single allele only

- 0.1.1
- Add cli parameter to set aligner penalties
- Document the codebase
- Update documentation to include interpretation of generated outpput

### DISCLAIMER
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
3 changes: 2 additions & 1 deletion docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ Basic:
Advanced:
- `--flank-len <FLANK_LEN>` Number of flanking nucleotides added to a target region during realignment, default = 50
- `--no-clip-aln` Score alignments without stripping the flanks
- `--parental-quantile <QUANTILE>` Quantile of alignment scores to determine the parental threshold, default is strict and takes only the top scoring alignment, default = 1.0
- `--parental-quantile <QUANTILE>` Quantile of alignment scores to determine the parental threshold, default is strict and takes only the top scoring alignment, default = 1.0
- `--aln-scoring` Scoring function for 2-piece gap affine alignment (non-negative values): mismatch,gap_opening1,gap_extension1,gap_opening2,gap_extension2, default = "8,4,2,24,1", see [here](https://github.com/smarco/WFA2-lib/) for more details on parametrization
52 changes: 15 additions & 37 deletions docs/example.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,30 @@
# Introduction

A brief overview of the steps necessary to perform *de novo* tandem repeat mutation calling from TRGT output using TRGT-denovo
A brief overview of the steps needed to call *de novo* tandem repeat mutations using TRGT-denovo given TRGT output.

## Prerequisites

- [TRGT binary](https://github.com/PacificBiosciences/trgt/releases/latest)
- [TRGT-denovo binary](https://github.com/PacificBiosciences/trgt-denovo/releases/latest)
- Aligned HiFi data of a family trio (father, mother, child)
- The reference genome used for the alignments
- A BED file with repeat expansion definitions (always same or a subset of those with TRGT)
- Repeat definition files are available in [this Zenodo repository](https://zenodo.org/record/8329210)
and definitions of known pathogenic repeats are [also available here](https://github.com/PacificBiosciences/trgt/tree/main/repeats/) (always the same sites or a subset of those used with TRGT)
- Aligned HiFi data of a family trio (father, mother, and child)
- The reference genome used for read alignment

## Calling *de novo* tandem repeats



## Calling *de novo* tandem repeat mutations

Given the following data:

- reference genome `reference.fasta`
- aligned sequencing data of the family (father, mother, and son respectively) `sample_F.bam`, `sample_M.bam`, `sample_S.bam`,
- repeat definition file `repeat.bed`.
- Reference genome `reference.fasta`
- Repeat definition file `repeat.bed`.
- Aligned sequencing data of the family (father, mother, and son respectively) `sample_F.bam`, `sample_M.bam`, and `sample_S.bam`.

### Data pre-processing

Prior to *de novo* calling all data must first be genotyped by TRGT:
All data must first be genotyped by TRGT:

```
./trgt --genome reference.fasta \
Expand Down Expand Up @@ -64,7 +68,7 @@ Such that you end up with `sample_F.sorted.vcf.gz`, `sample_F.spanning.sorted.ba

### Running TRGT-denovo

With all preprocessing completed it we can call *de novo* repeat expansion mutations using TRGT-denovo from the sample data. Note that family members are supplied by their common prefix of `spanning.sorted.bam` and `sorted.vcf.gz`, i.e., `sample_F`, `sample_M`, and `sample_S` and path if not running TRGT-denovo in the same directory as the data:
With all preprocessing completed, we can call *de novo* repeat expansion mutations using TRGT-denovo from the sample data. Note that family members are supplied by their common prefix of `spanning.sorted.bam` and `sorted.vcf.gz`, i.e., `sample_F`, `sample_M`, and `sample_S` and path if not running TRGT-denovo in the same directory as the data:

```
./TRGT-denovo trio --reference reference.fasta \
Expand All @@ -75,30 +79,4 @@ With all preprocessing completed it we can call *de novo* repeat expansion mutat
--out out.csv
```

## Example

Below output of HG002 is shown for two candidate *de novo* tandem repeat mutation sites:
```
trid genotype denovo_coverage allele_coverage allele_ratio child_coverage child_ratio mean_diff_father mean_diff_mother father_dropout_prob mother_dropout_prob allele_origin denovo_status per_allele_reads_father per_allele_reads_mother per_allele_reads_child index father_motif_counts mother_motif_counts child_motif_counts maxlh
chr1_47268728_47268830_ATAA 1 19 37 0.5135 37 0.5135 6.7368 6.7368 0.0000 0.0000 M:1 Y:= 43 26 37 0 25 25 25 0.7152
chr1_7862944_7863157_TATTG 1 0 21 0.0000 37 0.0000 0.0000 19.2000 0.0000 0.0000 F:2 X 18,17 16,19 21,16 0 27,29 27,63 29,60 0.9820
chr1_7862944_7863157_TATTG 2 16 16 1.0000 37 0.4324 171.8750 22.8750 0.0000 0.0000 M:2 Y:- 18,17 16,19 21,16 1 27,29 27,63 29,60 1.0000
```

### Site 1

The first site is homozygous in the child, hence only one call is made. It has a *de novo* coverage (DNC) of 19, i.e., there are 19 reads that support a candidate *de novo* allele relative to the parental read alignments. The DNC should always be put into context of the total coverage, to ascertain that:

1. There is sufficient coverage
2. The ratio between the two is close to 0.5

The DNC at this site is high and the ratio makes it likely that this is a confident call. However, the score difference with respect to either parents is low, such that the expected event size is small. Additionally, the score difference is equivalent in both parents such that parental origin may not be assessed. Generally the parent with the smallest score difference is the one from which is inherited:

![Example site 1](figures/example_site_1.png)

### Site 2

The second site is heterozygous in the child, hence both child alleles are tested. The second allele is a potential *de novo* call. The score difference with respect to the maternal alleles is the smallest.

![Example site 2](figures/example_site_2.png)

For further interpretation of TRGT-denovo output see [here](interpretation.md)
Binary file added docs/figures/dropout.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/figures/example_site_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/figures/example_site_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/imbalance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/match_mix_dist.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
75 changes: 75 additions & 0 deletions docs/interpretation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Interpreting TRGT-denovo output

TRGT-denovo records all callable repeat sites into a tab-separated file. Sites that could not be called (often due to issues such as missing genotyping in one of the family members) are logged.

While it is simpler to report only a subset of the repeat definitions that actually show *de novo* signal, reporting all sites allows for more in-depth interpretation and quality control. Certain data characteristics can inflate the number of *de novo* candidates, which often turn out to be false positives. For example, when a sample, particularly from one of the parents, has significantly lower coverage than the other family members, this significantly increases the likelihood of allele dropout. In this context, allele dropout refers to scenarios where one of the alleles has minimal or no sequencing reads:

<img style="display: block; margin: auto;"
src="figures/dropout.png">
<p style="text-align: center;">
Low or absent coverage complicates the accurate identification of de novo mutations.
</p>

In cases of dropout, there's a risk of misinterpreting a heterozygous allele as homozygous, potentially leading to erroneous categorization of a child's allele as *de novo*. Therefore, reporting all sites in the repeat catalog provides insights into the distribution of potential *de novo* calls. For instance, analyzing the *de novo* coverage metric (the number of child reads differing from parental data), and its distribution offers useful insights:

<img style="display: block; margin: auto;"
src="figures/match_mix_dist.png">
<p style="text-align: center;">
Distributions of the TRGT-denovo de novo coverage metric in a matched and permuted trio.
</p>

The figure above shows the *de novo* coverage metric distributions in both matched and permuted trios. A very small number of *de novo* loci is expected. The *de novo* coverage is a useful measure in assessing confidence, factoring in local coverage. As seen in the permuted trio, a substantial number of putative *de novo* mutations are found, aligning with expectations. Each site with increased *de novo* coverage suggests significant deviation in the child's data from the parental, indicating a potential *de novo* mutation, or in the case of the permuted sample, a possible sample mix-up. In matched trios, this metric is equally useful, clearly demonstrating the impact of coverage imbalance:

<img style="display: block; margin: auto;"
src="figures/imbalance.png">
<p style="text-align: center;">
Distributions of the TRGT-denovo de novo coverage metric in coverage balanced and imbalanced trios.
</p>

The distributions of the TRGT-denovo *de novo* coverage metric in coverage-balanced and imbalanced trios highlight an important point: as coverage decreases, the probability of insufficient allele coverage increases, leading to an increase in false positive *de novo* calls. These distributions are helpful in establishing thresholds to filter out false positives and refine the list of candidate mutations.

## HG002 example

Below TRGT-denovo output of HG002 is shown for two candidate *de novo* tandem repeat mutation sites:
```
trid genotype denovo_coverage allele_coverage allele_ratio child_coverage child_ratio mean_diff_father mean_diff_mother father_dropout_prob mother_dropout_prob allele_origin denovo_status per_allele_reads_father per_allele_reads_mother per_allele_reads_child index father_motif_counts mother_motif_counts child_motif_counts
chr1_47268728_47268830_ATAA 1 19 37 0.5135 37 0.5135 6.7368 6.7368 0.0000 0.0000 M:1 Y:= 43 26 37 0 25 25 25
chr1_7862944_7863157_TATTG 1 0 21 0.0000 37 0.0000 0.0000 19.2000 0.0000 0.0000 F:2 X 18,17 16,19 21,16 0 27,29 27,63 29,60
chr1_7862944_7863157_TATTG 2 16 16 1.0000 37 0.4324 171.8750 22.8750 0.0000 0.0000 M:2 Y:- 18,17 16,19 21,16 1 27,29 27,63 29,60
```

### Site 1

The first site is homozygous in the child, hence only one call is made:

```
chr1_47268728_47268830_ATAA 1 19 37 0.5135 37 0.5135 6.7368 6.7368 0.0000 0.0000 M:1 Y:= 43 26 37 0 25 25 25
```

It has a *de novo* coverage of 19, i.e., there are 19 reads that support a candidate *de novo* allele relative to the parental read alignments. The *de novo* coverage should always be put into context of the total coverage, to ascertain that:

1. There is sufficient coverage
2. The ratio between the two is close to 0.5

The *de novo* coverage at this site is high and the ratio makes it likely that this is a confident call. However, the score difference with respect to either parents is low, such that the expected event size is small. Additionally, the score difference is equivalent in both parents such that parental origin may not be assessed. Generally the parent with the smallest score difference is the one from which is inherited:

<img style="display: block; margin: auto;"
src="figures/example_site_1.png">
<p style="text-align: center;">
</p>

### Site 2

The second site is heterozygous in the child, hence both child alleles are tested:

```
chr1_7862944_7863157_TATTG 1 0 21 0.0000 37 0.0000 0.0000 19.2000 0.0000 0.0000 F:2 X 18,17 16,19 21,16 0 27,29 27,63 29,60
chr1_7862944_7863157_TATTG 2 16 16 1.0000 37 0.4324 171.8750 22.8750 0.0000 0.0000 M:2 Y:- 18,17 16,19 21,16 1 27,29 27,63 29,60
```

The second allele is a potential *de novo* call (note that the long allele in TRGT is always the second allele). The score difference with respect to the maternal alleles is the smallest.

<img style="display: block; margin: auto;"
src="figures/example_site_2.png">
<p style="text-align: center;">
</p>
6 changes: 3 additions & 3 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ TRGT-denovo scores and then outputs target sites in a tab-separated format, with
- `per_allele_reads_father` Number of reads partitioned per allele in the father (allele1, allele2)
- `per_allele_reads_mother` Number of reads partitioned per allele in the mother (allele1, allele2)
- `per_allele_reads_child` Number of reads partitioned per allele in the child (allele1, allele2)
- `father_dropout` Coverage cut-off dropout detection in father; possible values: Full dropout (`FD`), Haplotype dropout (`HD`), Not (`N`)
- `mother_dropout` Coverage cut-off dropout detection in mother; possible values: Full dropout (`FD`), Haplotype dropout (`HD`), Not (`N`)
- `child_dropout` Coverage cut-off dropout detection in child; possible values: Full dropout (`FD`), Haplotype dropout (`HD`), Not (`N`)
- `index` Index of this allele in the TRGT VCF, used for linking to `child_motif_counts`
- `father_motif_counts` TRGT VCF motif counts for this locus in the father
- `mother_motif_counts` TRGT VCF motif counts for this locus in the mother
- `child_motif_counts` TRGT VCF motif counts for this locus in the child
- `maxlh` Maximum likelihood score of child allele data relative to parental alleles (unstable)

TRGT-denovo does calling based on the genotyping and spanning reads generated by TRGT and will test and output at most two alleles in the event of a heterozygous site.
Loading

0 comments on commit 76b43c7

Please sign in to comment.