Skip to content

Commit

Permalink
bump readme with new elife link, AMSD instead of IHD
Browse files Browse the repository at this point in the history
  • Loading branch information
tomsasani committed Jan 4, 2024
1 parent 36fffe4 commit 1193aeb
Showing 1 changed file with 17 additions and 17 deletions.
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Epistasis between mutator alleles contributes to germline mutation rate variability in laboratory mice
# Epistasis between mutator alleles contributes to germline mutation spectrum variability in laboratory mice

[![docs](https://img.shields.io/badge/docs-latest-blue.svg)](https://quinlan-lab.github.io/proj-mutator-mapping/reference/)
![pytest](https://github.com/quinlan-lab/proj-mutator-mapping/actions/workflows/tests.yaml/badge.svg)
Expand All @@ -8,15 +8,15 @@

This repository includes:

1. Python code underlying the inter-haplotype distance (IHD) method described in our [latest manuscript](https://www.biorxiv.org/content/10.1101/2023.04.25.537217v1).
1. Python code underlying the aggregate mutation spectrum distance (AMSD) method described in our [latest manuscript](https://elifesciences.org/reviewed-preprints/89096).

2. A [`snakemake`](https://snakemake.readthedocs.io/en/stable/index.html) pipeline that can be used to reproduce all figures and analyses from the manuscript in a single command.

### Overview of inter-haplotype distance (IHD) method
### Overview of aggregate mutation spectrum distance (AMSD) method

![](img/fig-distance-method.png)

> **Overview of inter-haplotype distance method.**
> **Overview of aggregate mutation spectrum distance method.**
> **a)** A population of four haplotypes has been genotyped at three informative markers ($g_1$ through $g_3$); each haplotype also harbors unique *de novo* germline mutations.
In practice, *de novo* mutations are partitioned by $k$-mer context; for simplicity in this toy example, *de novo* mutations are simply classified into two possible mutation types (grey squares represent C>(A/T/G) mutations, while grey triangles represent A>(C/T/G) mutations). **b)** At each informative marker $g_n$, we calculate the total number of each mutation type observed on haplotypes that carry either parental allele (i.e., the aggregate mutation spectrum) using all genome-wide *de novo* mutations. For example, haplotypes with *A* (orange) genotypes at $g_1$ carry a total of three "triangle" mutations and five "square" mutations, and haplotypes with *B* (green) genotypes carry a total of six triangle and two square mutations. We then calculate the cosine distance between the two aggregate mutation spectra, which we call the "inter-haplotype distance." Cosine distance can be defined as $1 - \cos(\theta)$, where $\theta$ is the angle between two vectors; in this case, the two vectors are the two aggregate spectra. We repeat this process for every informative marker $g_n$. **c)** To assess the significance of any distance peaks in b), we perform permutation tests. In each of $N$ permutations, we shuffle the haplotype labels associated with the *de novo* mutation data, run a genome-wide distance scan, and record the maximum cosine distance encountered at any locus in the scan. Finally, we calculate the $1 - p$ percentile of the distribution of those maximum distances to obtain a genome-wide cosine distance threshold at the specified value of $p$.
Expand Down Expand Up @@ -54,13 +54,13 @@ If desired, the `-j` parameter can be used to set the number of jobs that shoul

> IMPORTANT: you'll need to have `tabix`, `bcftools`, and `bedtools` in your system path to reproduce the figures. You can get the former two tools [here](http://www.htslib.org/download/), and the latter [here](https://github.com/arq5x/bedtools2/releases).
## Running IHD
## Running AMSD

If you want to use the inter-haplotype distance (IHD) method on your own data, you can follow the instructions below.
If you want to use the aggregate mutation spectrum distance (AMSD) method on your own data, you can follow the instructions below.

### Description of input files

Before running an IHD scan, you'll need to prepare a
Before running an AMSD scan, you'll need to prepare a
small number of input files.

1. ***De novo* germline mutation data**
Expand Down Expand Up @@ -96,7 +96,7 @@ small number of input files.
3. **Marker information (optional)**

If you wish to generate Manhattan-esque plots that summarize the results
of an IHD scan, you'll need to provide a final CSV that links marker IDs with
of an AMSD scan, you'll need to provide a final CSV that links marker IDs with
either physical or genetic map positions (or both). This file should contain a column called `marker`, a column called `chromosome`, and a column specifying one or both of `cM` or `Mb`.

| marker | chromosome | cM | Mb |
Expand Down Expand Up @@ -124,13 +124,13 @@ small number of input files.
**Notes:**
> The `genotypes` dictionary should map the observed genotypes in file #2 to integer values that will be used during the IHD scan.
> The `genotypes` dictionary should map the observed genotypes in file #2 to integer values that will be used during the AMSD scan.
> The two parental alleles *must* be mapped to values of 0 and 2, respectively. Heterozygous and unknown genotypes *must* be mapped to values of 1.
### Running an inter-haplotype distance scan
### Running an aggregate mutation spectrum distance scan
A single IHD scan can be performed as follows:
A single AMSD scan can be performed as follows:
```
python scripts/run_ihd_scan.py \
Expand All @@ -143,13 +143,13 @@ There are a small number of optional arguments:
* `-k` sets the kmer size to use for the mutation types (k = 1 will compute distances between aggregate 1-mer mutation spectra, k = 3 will compute distances between aggregate 3-mer mutation spectra). Default value is 1.
* `-permutations` sets the number of permutations to use when calculating significance thresholds for the IHD scan. Default value is 1,000.
* `-permutations` sets the number of permutations to use when calculating significance thresholds for the AMSD scan. Default value is 1,000.
* `-distance_method` specifies the distance method to use when comparing aggregate mutation spectra. By default, the method is cosine distance (`-distance_method cosine`), but can also be a chi-square statistic (`distance_method chisquare`).
* `-threads` specifies the number of threads to use during the permutation testing step. IHD used `numba` for multi-threading. Default value is 1.
* `-threads` specifies the number of threads to use during the permutation testing step. AMSD used `numba` for multi-threading. Default value is 1.
### Plotting the results of an IHD scan
### Plotting the results of an AMSD scan
```
python scripts/plot_ihd_results.py \
Expand All @@ -176,9 +176,9 @@ These tests are run automatically via GitHub actions (for Python versions 3.8, 3
## Project layout
ihd/ # code for running the IHD method
ihd/ # code for running the AMSD method
utils.py # bulk of utility functions
run_ihd_scan.py # wrapper that calls utilities for computing IHD
run_ihd_scan.py # wrapper that calls utilities for computing AMSD
plot_ihd_results.py # script for plotting Manhattan-esque results
schema.py # pandera schema used to validate input/output dataframes
run_ihd_power_simulations.py # code for running power simulations
Expand All @@ -191,7 +191,7 @@ These tests are run automatically via GitHub actions (for Python versions 3.8, 3
data/
genotypes/ # contains formatted `.geno` files for the BXDs that contain sample genotypes at every tested marker
json/ # contains JSON configuration files for IHD scans using the BXDs
json/ # contains JSON configuration files for AMSD scans using the BXDs
mutations/ # contains per-sample *de novo* mutation data in the BXDs
exclude/ # contains an mm10 file containing problematic regions of the genome to avoid
Rqtl_data/ # contains Rqtl data for QTL scans using BXD data
Expand Down

0 comments on commit 1193aeb

Please sign in to comment.