Skip to content

Commit

Permalink
Release 2
Browse files Browse the repository at this point in the history
Full set of updates for round 2 of the Neale Lab GWAS (squashed commit here, see pull #3 for full commit history from branch `v2`).
  • Loading branch information
rkwalters authored Oct 19, 2019
1 parent c01d8c5 commit 2949f97
Show file tree
Hide file tree
Showing 4,613 changed files with 1,329,363 additions and 329,005 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "UKBB_ldsc_scripts"]
path = UKBB_ldsc_scripts
url = [email protected]:Nealelab/UKBB_ldsc_scripts.git
168 changes: 25 additions & 143 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,154 +1,36 @@
# UKBB ldsc scripts
# UKBB ldsc site

Scripts used for running LD score regression ([LDSC](https://github.com/bulik/ldsc)) to estimate SNP-heritability in UK Biobank. See the [UK Biobank GWAS announcement](http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank), the [hertiabilility analysis post](http://www.nealelab.is/blog/2017/9/15/heritability-of-2000-traits-and-disorders-in-the-uk-biobank) and the [results site](https://nealelab.github.io/UKBB_ldsc/).
Scripts used for generating the [results site](https://nealelab.github.io/UKBB_ldsc/) for LD score regression ([LDSC](https://github.com/bulik/ldsc)) analysis of the [UK Biobank GWAS](https://github.com/Nealelab/UK_Biobank_GWAS) produced by the [Neale Lab](http://www.nealelab.is/uk-biobank). You can read more about that GWAS effort [here](http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank) and read the [hertiabililty analysis post](http://www.nealelab.is/blog/2017/9/15/heritability-of-2000-traits-and-disorders-in-the-uk-biobank) for our initial thoughts on these results.

These scripts are mostly intended for documentation purposes, and may take some work to get running or adapt to other applications. All of the scripts described below are in the `./scripts/` folder. You can also find the code used to generate the [results site](https://nealelab.github.io/UKBB_ldsc/) in the `./site/` folder.
Contents of the site include:

# Table of Contents
* [Results browser](https://nealelab.github.io/UKBB_ldsc/h2_browser.html)
* [Interactive plots](https://nealelab.github.io/UKBB_ldsc/viz_h2.html)
* [Extended methods descriptions](https://nealelab.github.io/UKBB_ldsc/confidence.html)
* [Detailed results pages per phenotype](https://nealelab.github.io/UKBB_ldsc/h2_summary_50_irnt.html)
* [Downloads page](https://nealelab.github.io/UKBB_ldsc/downloads.html)

* [Usage](#usage)
* [Requirements](#requirements)
* [Setup (create\_hm3\_keytable\.py)](#setup-create_hm3_keytablepy)
* [Generated reference files](#generated-reference-files)
* [Munged sumstats files](#munged-sumstats-files)
* [Running ldsc (ldsc\_h2\_parallel\_batch\.py)](#running-ldsc-ldsc_h2_parallel_batchpy)
* [Implementation Note](#implementation-note)
* [Settings](#settings)
* [Submission script (ldsc\_h2\_parallel\_batch\_submit\.sh)](#submission-script-ldsc_h2_parallel_batch_submitsh)
* [Aggregating results (agg\_ldsc\.sh)](#aggregate-results-agg_ldscsh)
* [Results](#results)
Scripts used for the LD Score regression analysis itself are in the [UKBB_ldsc_scripts](https://github.com/Nealelab/UKBB_ldsc_scripts) dependency. They've been separated to make it easier for you to pull//fork/otherwise use those scripts without ending up with a download of the whole results site.

TOC created with [gh-md-toc](https://github.com/ekalinin/github-markdown-toc.go)
## Version History

* **September 2017:** first release
* **October 2019:** round2 release

# Usage
## Main contents of this repo

## Requirements
* `./site_build/build_site.R`: main driver script to build the static website using [R Markdown](https://rmarkdown.rstudio.com/). Copies the necessary source files into this directory and updates `_site.yml` to allow a clean build. Includes options to only generate a subset of the site for easier testing.
* `./docs/`: The full contents of the [results site](https://nealelab.github.io/UKBB_ldsc/) hosted on Github Pages.
* `./UKBB_ldsc_scripts/`: code for the LDSR analysis, including the Rmd files for results processing that are included in the methods section of this site. See the [UKBB_ldsc_scripts repo](https://github.com/Nealelab/UKBB_ldsc_scripts) for full description.
* `./site_source/`: source code for building the site, including:
* `./rmd_source/`: R markdown source for each page of the site. `_h2_part_template.html` is used to generate each of the individual phenotype pages, adapting to the results for that phenotype.
* `./yml_source/`: yaml code for the structure of the site and the pass parameters for rendering each page. Split into pieces to allow testing rendering subsets of the site. Full site yaml is stitched together by `./site_build/build_site.R` at run time.
* `analytics_header.html`: page header code to count hits and downloads
* `./sandflat` and `dt_styles.css`: custom CSS based on [Bootswatch themes](https://bootswatch.com/3/) and styling options for [Datatables](https://datatables.net/)
* `make_favicon.R` and `./icon/`: favicon images for browser tabs on various platforms

Broadly, the core scripts here depend on:
## Credits

* [Google Cloud Platform](https://cloud.google.com/), for storage and compute
* [cloudtools](https://github.com/Nealelab/cloudtools), for interaction with google dataproc
* [Hail](https://hail.is/), for cloud compute support
* [MTAG](https://github.com/omeed-maghzian/mtag), for it's modified implementation of [LDSC](https://github.com/bulik/ldsc)

See the respective scripts for more info.

## Setup (`create_hm3_keytable.py`)

Creates a Hail keytable ([docs](https://hail.is/docs/stable/hail.KeyTable.html)) of HapMap3 SNPs passing QC in UKBB that can be used for LD score regression. This takes the place of the filtering normally performed in `munge_sumstats.py` ([see code from ldsc](https://github.com/bulik/ldsc/blob/master/munge_sumstats.py)). Variants in the resulting list are:

* Sites in HapMap release 3, version 3
* Autosomal, biallelic SNPs
* Present in UKBB, based on matching chr:pos:ref:alt
* Passing basic QC in the UKBB GWAS sample, including restricting the HRC sites (see [UKBB_Hail](https://github.com/Nealelab/UKBB_Hail))
* Additionally passing LDSC QC thresholds of: INFO > 0.9, MAF > 0.01

The resulting keytable is stored at `gs://ukbb_association/ldsc/ld_ref_panel/hm3.r3.hg19.auto_bi_af.ukbb_gwas_qcpos.kt
` and has the following schema:
```
Struct{v_hm3:Variant,HM3_rsid:String,v_ukb:Variant,UKBB_rsid:String}
```
The two variant encodings managing the differences in contig labels between the HM3 VCF and UKBB (i.e. leading "chr", leading 0 on chrosomes 1-9).

### Generated reference files

In addition to the final keytable of HM3 variants passing QC to use for LDSC, serveral interim files are generated that may be of more general interest:


* Cloud copy of the HapMap3 sites VCF
```
HapMap3 vcf (from ftp.broadinstitute.org/bundle/hg19/):
gs://ukbb_association/ldsc/ld_ref_panel/hapmap_3.3_hg19_pop_stratified_af.vcf.gz
```

* HapMap3 sites file in VDS format
```
HapMap3 full vds:
gs://ukbb_association/ldsc/ld_ref_panel/hm3.r3.hg19.vds
```

* Keytable of autosomal, biallelic HapMap3 SNPs, including allele frequencies by population
```
HapMap3 autosomal, biallelic SNPs with freqs keytable:
gs://ukbb_association/ldsc/ld_ref_panel/hm3.r3.hg19.auto_bi_af.kt
```

* Keytable of HapMap3 SNPs passing QC in the full UKBB data (i.e. not filtered to unrelated, European ancestry, etc)
```
HapMap3 SNPs passing UKBB full cohort QC:
gs://ukbb_association/ldsc/ld_ref_panel/hm3.r3.hg19.auto_bi_af.ukbb_full_qcpos.kt
```


### Munged sumstats files

This final QC+ keytable (`hm3.r3.hg19.auto_bi_af.ukbb_gwas_qcpos.kt`) is then used the generate ldsc sumstats files (format described [here](https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)) as part of the GWAS output from UKBB. In the few instances where rsids for the same chr:pos:ref:alt differ between HM3 and UKBB, the output sumstats files use the rsid reported by UKBB.

These files are presently stored in:
```
gs://ukbb-gwas-results/ldsc/
```


## Running ldsc (`ldsc_h2*_parallel_batch.py`)

These script run [ldsc](https://github.com/bulik/ldsc) to estimate SNP-heritability for a batch of phenotypes using [MTAG](https://github.com/omeed-maghzian/mtag). We use the version of ldsc provided with MTAG to take advantage of it's interface for calling ldsc from within python rather than via the command line.

The script `ldsc_h2_parallel_batch.py` runs univariate LD score regression with the standard default `eur_w_ld_chr` pre-computed European population LD scores. Results are parsed into a gzipped tsv with columns:

```
phenotype, mean_chi2, lambdaGC, intercept, intercept_se, intercept_z, intercept_p, ratio, ratio_se, h2_observed, h2_observed_se, h2_liability, h2_liability_se, h2_z, h2_p
```

The script `ldsc_h2part_parallel_batch.py` similarly runs partitioned LD score regression, using the `baselineLD_v1.1` set of precomputed LD scores from 1000 Genomes Phase 3 European populations available (here)[https://data.broadinstitute.org/alkesgroup/LDSCORE/] and described by (Gazal et al. (2017))[http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3954.html]. The full set of results for the annotations (e.g. corresponding to the `--print-coefficients` output file) are converted to a flat row of results per phenotype.

Conversion to liability-scale h2 for dichotomous traits is done assuming that the population prevelence is equal to the prevelence in the UKBB GWAS sample.


### Implementation Note

This is almost certainly not the ideal way to structure this analysis. Making a human manage the splitting/batching here somewhat defeats the purpose of having flexible cloud compute. We're doing it this way currently for expediency while we investigate better long-term alternatives.

With the current settings in the submission script running univariate h2 for ~1500 traits takes a little over 10 node hours (~160 CPU hours) split over 10 minimal `n1-highcpu-16` VMs, each running 8 traits in parallel at a time. Paritioned heritbaility is slower, taking just under 20 hours on 10 `n1-standard-16` VMs running 6 traits at a time for the same traits. Attempts to scale this up to more traits on a machine (with 32- or 64-core VMs) have seen poor performance, likely due to being I/O bound for reading reference and sumstat files.


### Settings

Using this script involves both setting job information in the script and passing arguments for parallelization via the command line. The following setting are set by editing the the python script (`ldsc_h2_parallel_batch.py` or `ldsc_h2part_parallel_batch.py`):

```
ld_ref_panel = '/home/mtag/mtag-master/ld_ref_panel/eur_w_ld_chr/' # local path
phen_summary = 'gs://ukbb-gwas-results/ukb1859/ukb1859_phenosummary_final.tsv' # in cloud
num_phens = 1564
ss_bucket = 'gs://ukbb-gwas-results/ldsc' # 'gs://ukbb_association/ldsc/sumstats' # bucket with sumstats.gz files
out_bucket = 'gs://ukbb-gwas-results/ldsc_results' # ouput google bucket location
num_proc = 8 # number of processes to run
```

From the command line, this script expects:
* `--parsplit INT`, the number of parallel batches to split phenotypes into
* `--paridx INT`, the index of the current batch amoung the `parsplit` batches

It's assumed that there are `num_phens` phenotypes in the `phen_summary` file and they they all have ldsc sumstats results files in the specified bucket. The mapping between the phenotypes names and the filenames for the sumstats files is hardcoded in the script (see `ss_name`, currently around line 176).

Within an instance of `ldsc_h2_parallel_batch.py`, it will estimate h2 for `num_proc` phenotypes at a time in parallel, and continue running until it's share of the `num_phens` phenotypes is finished.


### Submission script (`ldsc_h2*_parallel_batch_submit.sh`)

**WARNING: THIS SCRIPT WILL CREATE NEW CLUSTERS BUT WILL NOT SHUT THEM DOWN**

This is a simple bash script to loop job submission of `ldsc_h2*_parallel_batch.py` for each batch of phenotypes, assuming parallelization to `$maxi` batches. A new cluster is spun up and a job submitted to each cluster using [cloudtools](https://github.com/Nealelab/cloudtools). If using this script, please monitor the jobs and **shut down each cluster when it completes**.


## Aggregating results (`agg_ldsc.sh`)

Downloads the results for each ldsc batch, combines them into a single file, uploads that file back to the cloud, and creates a local Rdata object for use in R markdown (see next). Descriptive information is also incorporated from the appropriate phenosummary files.


# Results

* [Results browser](https://nealelab.github.io/UKBB_ldsc/index.html)
* [Full downloads](https://www.dropbox.com/sh/g28ee03vy4hvqw5/AADAkDbSFHsYE8bme1Jjeekca/ldsc_results?dl=0)
Site generated by Raymond Walters (i.e. all mistakes are his), with helpful improvements from Duncan Palmer and support from the full Neale Lab UKB team. See the site for [full credits](https://nealelab.github.io/UKBB_ldsc/credits.html) for this project.

For any questions/comments/feedback, email **[email protected]**.
1 change: 1 addition & 0 deletions UKBB_ldsc_scripts
Submodule UKBB_ldsc_scripts added at 182a6f
20 changes: 0 additions & 20 deletions docs/code.txt

This file was deleted.

Loading

0 comments on commit 2949f97

Please sign in to comment.