For next release

pgxcentre · Jun 22, 2016 · 795c7db · 795c7db
2 parents 5221a86 + 28c24bd
commit 795c7db
Show file tree

Hide file tree

Showing 63 changed files with 6,225 additions and 2,965 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -1,6 +1,6 @@
 [run]
 branch = True
-include = genipe*
+source = genipe
 
 [report]
 exclude_lines = 

diff --git a/.travis.yml b/.travis.yml
@@ -1,9 +1,9 @@
 language: python
 python:
-  - "3.3"
   - "3.4"
+  - "3.5"
 before_install:
-  - "wget http://repo.continuum.io/miniconda/Miniconda3-3.4.2-Linux-x86_64.sh -O miniconda.sh"
+  - "wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh"
   - "bash miniconda.sh -b -p $HOME/miniconda"
   - "export PATH=$HOME/miniconda/bin:$PATH"
   - "hash -r"
@@ -15,10 +15,12 @@ before_install:
   - "conda info -a"
   - "python --version"
 install:
+  - "conda install -q nomkl"
   - "conda install -q jinja2"
   - "conda install -q numpy"
   - "conda install -q pandas"
   - "conda install -q scipy"
+  - "conda install -q patsy"
   - "conda install -q statsmodels"
   - "pip install --no-deps pyfaidx"
   - "pip install --no-deps lifelines"

diff --git a/README.mkd b/README.mkd
@@ -18,8 +18,8 @@ Full documentation is available at
 
 ## Installation
 
-We recommend installing the package in a Python 3 virtual environment. There
-are two ways to install: `pip` or `conda`.
+We recommend installing the package in a Python 3.4 (or latest) virtual
+environment. There are two ways to install: `pip` or `conda`.
 
 ```bash
 # Using pip
@@ -41,12 +41,12 @@ The complete installation procedure is available in the
 
 ### Dependencies
 
-The tool requires a standard [Python](http://python.org/) 3 installation with
-the following modules:
+The tool requires a standard [Python](http://python.org/) 3.4 (or latest)
+installation with the following modules:
 
 * `numpy` version 1.8.2 and latest
 * `Jinja2` version 2.7.3 and latest
-* `pandas` version 0.15.2 and latest
+* `pandas` version 0.17.0 and latest
 * `setuptools` version 12.0.5 and latest
 
 The tool requires the binaries for
@@ -62,6 +62,7 @@ and Cox's regressions), `genipe` requires the following Python modules:
 
 * `Matplotlib` version 1.4.2 or latest
 * `scipy` version 0.15.1 or latest
+* `patsy` version 0.4.1 or latest
 * `statsmodels` version 0.6.1 or latest
 * `lifelines` version 0.7.0 or latest
 * `Biopython` version 1.65 or latest
@@ -99,20 +100,27 @@ analysis.
 ```console
 $ genipe-launcher --help
 usage: genipe-launcher [-h] [-v] [--debug] [--thread THREAD] --bfile PREFIX
-                       [--reference FILE] [--output-dir DIR] [--bgzip]
-                       [--use-drmaa] [--drmaa-config FILE] [--preamble FILE]
+                       [--reference FILE] [--chrom CHROM [CHROM ...]]
+                       [--output-dir DIR] [--bgzip] [--use-drmaa]
+                       [--drmaa-config FILE] [--preamble FILE]
                        [--shapeit-bin BINARY] [--shapeit-thread INT]
-                       [--plink-bin BINARY] [--impute2-bin BINARY]
-                       [--segment-length BP] --hap-template TEMPLATE
-                       --legend-template TEMPLATE --map-template TEMPLATE
-                       --sample-file FILE [--filtering-rules RULE [RULE ...]]
-                       [--probability FLOAT] [--completion FLOAT]
-                       [--info FLOAT] [--report-number NB]
-                       [--report-title TITLE] [--report-author AUTHOR]
+                       [--shapeit-extra OPTIONS] [--plink-bin BINARY]
+                       [--hap-template TEMPLATE] [--legend-template TEMPLATE]
+                       [--map-template TEMPLATE] --sample-file FILE
+                       [--hap-nonPAR FILE] [--hap-PAR1 FILE] [--hap-PAR2 FILE]
+                       [--legend-nonPAR FILE] [--legend-PAR1 FILE]
+                       [--legend-PAR2 FILE] [--map-nonPAR FILE]
+                       [--map-PAR1 FILE] [--map-PAR2 FILE]
+                       [--impute2-bin BINARY] [--segment-length BP]
+                       [--filtering-rules RULE [RULE ...]]
+                       [--impute2-extra OPTIONS] [--probability FLOAT]
+                       [--completion FLOAT] [--info FLOAT]
+                       [--report-number NB] [--report-title TITLE]
+                       [--report-author AUTHOR]
                        [--report-background BACKGROUND]
 
 Execute the genome-wide imputation pipeline. This script is part of the
-'genipe' package, version 1.2.3.
+'genipe' package, version 1.3.0.
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -127,6 +135,8 @@ Input Options:
                         reference files) (optional).
 
 Output Options:
+  --chrom CHROM [CHROM ...]
+                        The chromosomes to process.
   --output-dir DIR      The name of the output directory. [genipe]
   --bgzip               Use bgzip to compress the impute2 files.
 
@@ -144,13 +154,15 @@ HPC Options:
 SHAPEIT Options:
   --shapeit-bin BINARY  The SHAPEIT binary if it's not in the path.
   --shapeit-thread INT  The number of thread for phasing. [1]
+  --shapeit-extra OPTIONS
+                        SHAPEIT extra parameters. Put extra parameters between
+                        single or normal quotes (e.g. --shapeit-extra '--
+                        states 100 --window 2').
 
 Plink Options:
   --plink-bin BINARY    The Plink binary if it's not in the path.
 
-IMPUTE2 Options:
-  --impute2-bin BINARY  The IMPUTE2 binary if it's not in the path.
-  --segment-length BP   The length of a single segment for imputation. [5e+06]
+IMPUTE2 Autosomal Reference:
   --hap-template TEMPLATE
                         The template for IMPUTE2's haplotype files (replace
                         the chromosome number by '{chrom}', e.g.
@@ -164,8 +176,36 @@ IMPUTE2 Options:
                         chromosome number by '{chrom}', e.g.
                         'genetic_map_chr{chrom}_combined_b37.txt').
   --sample-file FILE    The name of IMPUTE2's sample file.
+
+IMPUTE2 Chromosome X Reference:
+  --hap-nonPAR FILE     The IMPUTE2's haplotype file for the non-
+                        pseudoautosomal region of chromosome 23.
+  --hap-PAR1 FILE       The IMPUTE2's haplotype file for the first
+                        pseudoautosomal region of chromosome 23.
+  --hap-PAR2 FILE       The IMPUTE2's haplotype file for the second
+                        pseudoautosomal region of chromosome 23.
+  --legend-nonPAR FILE  The IMPUTE2's legend file for the non-pseudoautosomal
+                        region of chromosome 23.
+  --legend-PAR1 FILE    The IMPUTE2's legend file for the first
+                        pseudoautosomal region of chromosome 23.
+  --legend-PAR2 FILE    The IMPUTE2's legend file for the second
+                        pseudoautosomal region of chromosome 23.
+  --map-nonPAR FILE     The IMPUTE2's map file for the non-pseudoautosomal
+                        region of chromosome 23.
+  --map-PAR1 FILE       The IMPUTE2's map file for the first pseudoautosomal
+                        region of chromosome 23.
+  --map-PAR2 FILE       The IMPUTE2's map file for the second pseudoautosomal
+                        region of chromosome 23.
+
+IMPUTE2 Options:
+  --impute2-bin BINARY  The IMPUTE2 binary if it's not in the path.
+  --segment-length BP   The length of a single segment for imputation. [5e+06]
   --filtering-rules RULE [RULE ...]
                         IMPUTE2 filtering rules (optional).
+  --impute2-extra OPTIONS
+                        IMPUTE2 extra parameters. Put the extra parameters
+                        between single or normal quotes (e.g. --impute2-extra
+                        '-buffer 250 -Ne 20000').
 
 IMPUTE2 Merger Options:
   --probability FLOAT   The probability threshold for no calls. [<0.9]
@@ -223,7 +263,7 @@ usage: imputed-stats [-h] [-v] {cox,linear,logistic,mixedlm,skat} ...
 
 Performs statistical analysis on imputed data (either SKAT analysis, or
 linear, logistic or survival regression). This script is part of the 'genipe'
-package, version 1.2.3).
+package, version 1.3.0.
 
 optional arguments:
   -h, --help            show this help message and exit

diff --git a/conda_build.sh b/conda_build.sh
@@ -0,0 +1,79 @@
+#!/usr/bin/env bash
+
+# Getting genipe's version to build
+genipe_version=$1
+if [ -z $genipe_version ]
+then
+    echo "usage: $0 VERSION" 1>&2
+    exit 1
+fi
+
+# Creating a directory for the build module
+mkdir -p conda_dist
+
+# Creating a directory for the skeleton
+mkdir -p skeleton
+pushd skeleton
+
+# Creating the skeleton
+conda skeleton pypi genipe --version $genipe_version
+
+# Checking that fetching genipe was successful
+if [ $? -ne 0 ]
+then
+    echo "Error when creating skeleton for genipe version $genipe_version" 1>&2
+    exit 1
+fi
+
+# The different python versions and platforms
+python_versions="3.4 3.5"
+platforms="linux-32 linux-64 osx-64"
+
+# Building
+for python_version in $python_versions
+do
+    # Building
+    conda build --python $python_version genipe &> log.txt
+
+    # Checking the build was completed
+    if [ $? -ne 0 ]
+    then
+        cat log.txt
+        echo "Error when building genipe $genipe_version (python" \
+             "$python_version)" 1>&2
+        exit 1
+    fi
+
+    # Fetching the file name of the build
+    filename=$(egrep "^# [$] anaconda upload \S+$" log.txt | cut -d " " -f 5)
+
+    # Checking the file exists
+    if [ -z $filename ]||[ ! -e $filename ]
+    then
+        echo "Problem fetching file $filename" 1>&2
+        exit 1
+    fi
+
+    # Converting to the different platforms
+    for platform in $platforms
+    do
+        conda convert -p $platform $filename -o ../conda_dist
+
+        # Checking the conversion was completed
+        if [ $? -ne 0 ]
+        then
+            echo "Problem converting genipe $genipe_version (python" \
+                 "$python_version) to $platform" 1>&2
+            exit 1
+        fi
+
+    done
+done
+
+popd
+rm -rf skeleton
+
+# Indexing
+pushd conda_dist
+conda index *
+popd
diff --git a/docs/_static/images/Linear_Walltime.png b/docs/_static/images/Linear_Walltime.png
diff --git a/docs/_static/images/Linear_Walltime_Plink.png b/docs/_static/images/Linear_Walltime_Plink.png
diff --git a/docs/_static/images/Logistic_Walltime.png b/docs/_static/images/Logistic_Walltime.png
diff --git a/docs/_static/images/Logistic_Walltime_Plink.png b/docs/_static/images/Logistic_Walltime_Plink.png
diff --git a/docs/_static/images/MixedLM_TS_Diff.png b/docs/_static/images/MixedLM_TS_Diff.png
diff --git a/docs/_static/images/MixedLM_Walltime.png b/docs/_static/images/MixedLM_Walltime.png
diff --git a/docs/_static/images/Survival_Walltime.png b/docs/_static/images/Survival_Walltime.png
diff --git a/docs/_static/images/execution_time.png b/docs/_static/images/execution_time.png
diff --git a/docs/_static/tutorial/phenotypes_mixedlm.txt.bz2 b/docs/_static/tutorial/phenotypes_mixedlm.txt.bz2
diff --git a/docs/execution_time.rst b/docs/execution_time.rst
@@ -0,0 +1,45 @@
+
+.. _stats-exec-time:
+
+Statistical Analysis Execution Time
+====================================
+
+GWAS analysis of imputed markers is computationally intensive. While it is
+feasible to run such analyses on some simple models like linear and logistic
+regression, more complex models like Cox regression and mixed linear models
+require more computing power or specialized implementations.
+
+We have optimized the mixed linear model analysis to significantly decrease
+computation time. Using a two-step approach (as described by Sikorska *et al.*,
+2015 [doi: `10.1038/ejhg.2015.1
+<http://www.nature.com/ejhg/journal/v23/n10/abs/ejhg20151a.html>`_]), the
+execution time is comparable to a simple linear regression. Prior to
+optimization, the analysis of chromosome 2 was performed in 53 hours for 33
+sub-analysis with 6 threads each (which corresponds to 198 threads).
+
+The following figure shows the execution time for a typical imputation analysis
+of chromosome 2, imputed for 5,045 samples. Chromosome 2 was composed a total
+of 1,170,797 loci, where 961,019 were of sufficient quality, and 528,932 had a
+MAF higher than 1%. The black dashed line is the execution time for Plink.
+
+.. figure:: _static/images/execution_time.png
+   :align: center
+   :width: 70%
+   :alt: Statistical analysis exection time.
+
+.. note::
+
+   On some installation, when executing the analysis with *n* threads,
+   *OPENBLAS* automatically uses all the CPUs for each thread, such that the
+   load quickly increases to *n* times the number of CPUs. Such high load slows
+   down the analysis considerably.
+
+   To avoid this, always export the following environment variable and specify
+   the total number of threads using the ``--nb-process`` option.
+
+   .. code-block:: bash
+
+      export OPENBLAS_NUM_THREADS=1
+
+We are planning to optimize the Cox's proportional hazard regression in the
+near future.