Skip to content

Pipeline for GWAS summary statistic dataset processing

Notifications You must be signed in to change notification settings

twillis209/GWAS_tools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GWAS_tools, a pipeline for processing GWAS summary statistics into a uniform format

This is a fork of GWAS_tools, restructuring what was a bash program into a snakemake pipeline depending on a series of R scripts.

All but one dependency is managed using the accompanying docker image. snakemake will pull this from Dockerhub for you, but for the sake of transparency I've also included the Dockerfile at docker/Dockerfile if you would like to examine its contents. I've not yet found a stable download URL for the liftOver executable and I think copying my own instance may violate some sort of copyright, so for now you'll have to put it on your own path (download here).

This pipeline also depends on the 1kGP_pipeline, also hosted here on GitHub. This workflow downloads and processes the 1000 Genomes Project Phase 3 data and its use should be entirely transparent as snakemake will download it and make use of it through the module statement.

A version without containers

At present I'm not able to run this pipeline in my current HPC environment due to some issues with apptainer permissions, so have had to roll back the containerisation in the dedicated conda branch (master contains the docker version). This environment does not container liftOver and all dependencies are listed in workflow/envs/global.yaml.

Importing this workflow in another snakemake workflow using module (and the role of the config)

I wrote this workflow so it could be plugged into others with use of the module statement. I do this like so in the igad_paper_pipeline worfklow:

module gwas_pipeline:
    snakefile: github("twillis209/GWAS_tools", path = "workflow/Snakefile", branch = "conda")
    config: config["GWAS_tools"]

Overwriting the workflow's config with the config statement is necessary to customise the GWAS_tools workflow's handling of the data sets in your importing workflow, typically to tell it which column contains the ref allele and which the alt allele, e.g.

 scz:
    ref: "^A1$"
    alt: "^A2$"

You can also list extraneous_columns which you would like GWAS_tools to drop and snps_to_retain which do not appear in the 1kG reference panel used to prune SNPs in the later part of the pipeline. See the configfile for igad_paper_pipeline for an example of customising the workflow's behaviour in this way.

Testing the pipeline

The workflow/rules/test_rules.smk file contains a target which can be used to run the pipeline on a rheumatoid arthritis data set from Okada et al..

From the root directory of the repository, run:

snakemake test_pipeline

You'll want to pass your own profile using the --profile CLI argument, too.

About

Pipeline for GWAS summary statistic dataset processing

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 60.6%
  • Python 37.2%
  • Dockerfile 2.2%