Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
lemieuxl committed Nov 3, 2015
2 parents 60c8f55 + 76f3162 commit 7bbb72e
Show file tree
Hide file tree
Showing 8 changed files with 1,480 additions and 192 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,6 @@ pyplink/version.py
.coveragerc
htmlcov

.ipynb_checkpoints

build
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ python:
- "2.7"
- "3.3"
- "3.4"
- "3.5"
script: "python setup.py test"
162 changes: 15 additions & 147 deletions README.mkd
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
[![PyPI version](https://badge.fury.io/py/pyplink.svg)](http://badge.fury.io/py/pyplink)


# pyplink - Module to read binary files from Plink
# pyplink - Module to process Plink's binary files

`PyPlink` is a Python module to read binary Plink files.
`PyPlink` is a Python module to read and write Plink's binary files.


## Dependencies
Expand All @@ -13,7 +13,8 @@ The tool requires a standard [Python](http://python.org/) installation (2.7 or
3.4) with the following modules:

1. [numpy](http://www.numpy.org/) version 1.8.2 or latest
1. [pandas](http://pandas.pydata.org/) version 0.14.1 or latest
2. [pandas](http://pandas.pydata.org/) version 0.14.1 or latest
3. [six](https://pythonhosted.org/six/) version 1.9.0 or latest

The tool has been tested on *Linux* only, but should work on *MacOS* and
*Windows* operating systems as well.
Expand All @@ -34,8 +35,8 @@ conda install pyplink -c http://statgen.org/wp-content/uploads/Softwares/pyplink
```

It is possible to add the channel to conda's configuration, so that the
`-c http://statgen.org/...` can be omitted to update or install. Only perform
the following command:
`-c http://statgen.org/...` can be omitted to update or install the package.
To add the channel, perform the following command:

```bash
conda config --add channels http://statgen.org/wp-content/uploads/Softwares/pyplink
Expand All @@ -61,156 +62,23 @@ conda update pyplink -c http://statgen.org/wp-content/uploads/Softwares/pyplink
```


## Example

This example describe how to work with the `pyplink` module.


### Data description

```python
>>> from pyplink import PyPlink
>>> pedfile = PyPlink("prefix")
>>> pedfile.get_nb_samples()
10656
>>> pedfile.get_nb_markers()
141701
>>> samples = pedfile.get_fam()
>>> samples.head()
fid iid father mother gender status
0 Sample_1 Sample_1 0 0 2 -9
1 Sample_2 Sample_2 0 0 1 -9
2 Sample_3 Sample_3 0 0 2 -9
3 Sample_4 Sample_4 0 0 1 -9
4 Sample_5 Sample_5 0 0 2 -9
>>> all_markers = pedfile.get_bim()
>>> all_markers.head()
chrom pos a1 a2
snp
exm2268640 1 762320 A G
exm47 1 865628 A G
exm53 1 865665 A G
exm55 1 865694 A G
exm56 1 865700 A G
```


### Iterating over all markers

Cycling through genotypes as `-1`, `0`, `1` and `2` values, where `-1` is
unknown, `0` is homozygous (major allele), `1` is heterozygous and `2` is
homozygous (minor allele).

```python
>>> for marker_id, genotypes in pedfile:
... print(marker_id, genotypes)
... break
...
exm2268640 [0 0 0 ..., 0 0 0]
>>> for marker_id, genotypes in pedfile.iter_geno():
... print(marker_id, genotypes)
... break
...
exm2268640 [0 0 0 ..., 0 0 0]
```

Cycling through genotypes as `A`, `C`, `G`, `T` values.

```python
>>> for marker_id, genotypes in pedfile.iter_acgt_geno():
... print(marker_id, genotypes)
... break
...
exm2268640 ['GG' 'GG' 'GG' ..., 'GG' 'GG' 'GG']
```


### Iterating over selected markers

Cycling through genotypes as `-1`, `0`, `1` and `2` values.

```python
>>> markers = ["exm47", "exm2253575", "exm269"]
>>> for marker_id, genotypes in pedfile.iter_geno_marker(markers):
... print(marker_id, genotypes)
...
exm47 [0 0 0 ..., 0 0 0]
exm2253575 [1 1 1 ..., 0 0 2]
exm269 [0 0 0 ..., 0 1 0]
```

Cycling through genotypes as `A`, `C`, `G`, `T` values.

```python
>>> for marker_id, genotypes in pedfile.iter_acgt_geno_marker(markers):
... print(marker_id, genotypes)
...
exm47 ['GG' 'GG' 'GG' ..., 'GG' 'GG' 'GG']
exm2253575 ['GA' 'GA' 'GA' ..., 'AA' 'AA' 'GG']
exm269 ['GG' 'GG' 'GG' ..., 'GG' 'AG' 'GG']
```


### Getting a single marker

To get the genotypes (as `-1`, `0`, `1` and `2` values) of a single marker:
```python
>>> pedfile.get_geno_marker("exm47")
[0 0 0 ..., 0 0 0]
```

To get the genotypes (as `A`, `C`, `G`, `T` values) of a single marker:
```python
>>> pedfile.get_acgt_geno_marker("exm47")
['GG' 'GG' 'GG' ..., 'GG' 'GG' 'GG']
```


### Misc example

To get all markers on the Y chromosomes for the males.

```python
>>> y_markers = all_markers[all_markers.chrom == 23].index.values
>>> males = samples.gender == 1
>>> for marker_id, genotypes in pedfile.iter_geno_marker(y_markers):
... male_genotypes = genotypes[males.values]
... print("{:,d} total genotypes".format(len(genotypes)))
... print("{:,d} genotypes for {:,d} "
... "males".format(len(male_genotypes), males.sum()))
... break
...
10,656 total genotypes
6,297 genotypes for 6,297 males
```

To count the minor allele frequency.

```python
>>> from collections import Counter
>>> markers = ["exm47", "exm2253575", "exm269"]
>>> for marker_id, genotypes in pedfile.iter_geno_marker(markers):
... geno_counter = Counter(genotypes)
... nb_alleles = sum(geno_counter[i] * 2 for i in range(3))
... maf = (geno_counter[2] * 2 + geno_counter[1]) / nb_alleles
... print(marker_id, maf)
...
exm47 0.0070389488503050214
exm2253575 0.3461719116956318
exm269 0.05987799155326138
```


## Testing

To test the module, just perform the following command:

```python
>>> import pyplink
>>> pyplink.test()
.................
.......................
----------------------------------------------------------------------
Ran 17 tests in 0.060s
Ran 23 tests in 0.149s

OK
```


## Example

The following
[notebook](http://nbviewer.ipython.org/github/lemieuxl/pyplink/blob/binary_write/demo/PyPlink%20Demo.ipynb)
contains a demonstration (for both Python 2 and 3) of the `PyPlink` module.
34 changes: 34 additions & 0 deletions conda_build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env bash

# Creating a directory for the skeleton
mkdir -p skeleton
pushd skeleton

# Creating the skeleton
conda skeleton pypi pyplink

# The different python versions and platforms
python_versions="2.7 3.3 3.4 3.5"
platforms="linux-32 linux-64 osx-64 win-32 win-64"

# Building
for python_version in $python_versions
do
# Building
conda build --python $python_version pyplink &> log.txt
filename=$(egrep "^# [$] binstar upload \S+$" log.txt | cut -d " " -f 5)

# Converting
for platform in $platforms
do
conda convert -p $platform $filename -o ../conda_dist
done
done

popd
rm -rf skeleton

# Indexing
pushd conda_dist
conda index *
popd
Loading

0 comments on commit 7bbb72e

Please sign in to comment.