Skip to content
This repository has been archived by the owner on Dec 6, 2018. It is now read-only.

Commit

Permalink
Merge pull request #32 from tesera/release-1.0.0
Browse files Browse the repository at this point in the history
Release 1.0.0
  • Loading branch information
jo-tham authored Sep 21, 2016
2 parents 015950b + 6cd151a commit 27f93a1
Show file tree
Hide file tree
Showing 14 changed files with 225 additions and 83 deletions.
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

FROM r-base:latest

MAINTAINER Tesera Systems Inc.
Expand All @@ -17,8 +16,9 @@ RUN apt-get update && apt-get install -y \
bats \
&& rm -rf /var/lib/apt/lists/*

ENV PYLEARN_REF master
ENV RLEARN_REF master
ENV PYLEARN_REF v1.0.1
ENV PRELURN_REF v1.0.0
ENV RLEARN_REF v1.0.1

ENV WD /opt/learn
ENV HISTFILE $WD/.bash_history
Expand Down
72 changes: 39 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,18 @@

[ ![Codeship Status for tesera/learn-cli](https://codeship.com/projects/f2a31230-b7e8-0133-9192-1269d3e58a72/status?branch=master)](https://codeship.com/projects/134949)

learn-cli performs variable selection, model development and target dataset processing. It uses [pylearn](https://github.com/tesera/pylearn) and [rlearn](https://github.com/tesera/rlearn) libraries. The cli invokes rlearn function via rpy2.

Although the cli is docker ready you can choose to run the cli locally the old fashion way. Running the cli in Docker will simplify the efforts tremendously but Docker is not required.

### Prerequisites

* R
* Python 2.7
* rlearn `library('devtools'); install_github(repo='tesera/rlearn', dependencies=TRUE, ref='master');`
* AWS Access Keys (optional: for using S3 data location)

### Install

```console
$ pip install git+https://github.com/tesera/learn-cli.git
```

### Usage
learn-cli performs machine learning tasks, including variable selection, model
development and target dataset processing. It uses
[pylearn](https://github.com/tesera/pylearn),
[prelurn](https://github.com/tesera/pylearn), and
[rlearn](https://github.com/tesera/rlearn) libraries. The cli invokes rlearn
function via rpy2.

Support for developing and using the CLI is only provided if you are using
docker, as the CLI has a fairly complex set of requirements (packages,
runtimes, etc)

## Usage
```console
$ learn --help
Usage:
Expand Down Expand Up @@ -54,17 +48,13 @@ Examples:
learn discrat --xy-data s3://bucket/xy_reference.csv --x-data s3://bucket/x_filtered.csv --dfunct s3://bucket/dfunct.csv --idf s3://bucket/idf.csv --varset 18 --output s3://bucket/varsel
```

### Testing

`bats ./tests/intergration`

### Docker
## Setup with Docker

If you are using docker-machine make sure you have a machine running and that you have evaluated the machine environment.

#### Creating a Docker Machine Host VM
### Creating a Docker Machine Host VM

#####Windows Powershell
#### Windows Powershell
```console
$ docker-machine create --driver virtualbox --virtualbox-host-dns-resolver default
$ docker-machine env --shell powershell default | Invoke-Expression
Expand All @@ -76,19 +66,19 @@ $ docker-machine create --driver virtualbox default
$ eval "$(docker-machine env default)"
```

#### Running the container
### Running the container

```console
$ docker build -t learn .
$ docker run learn /bin/bash
root@1e36bb3275b5:/opt/learn# learn --help
```

#### Development
### Development

During development you will want to bring in the codebase with you in the container. You can simply use the Docker Compose command bellow. Once in the container run the `install-dependencies.sh` script passing in the `--dev` flag to make the project editable. This wil install all the Python dependencies in the project root under the `pysite folder and the R dependencies under the rlibs folder. You will only need to run this once unless you dependencies change.
During development you will want to bring in the codebase with you in the container. You can simply use the Docker Compose command bellow. Once in the container run the `install-dependencies.sh` script passing in the `--dev` flag to make the project editable. This wil install all the Python dependencies in the project root under the `pysite folder and the R dependencies under the rlibs folder. You will only need to run this once unless your dependencies change.

You will need to add a `dev.env` file with at least `PYLEARN_REF` and `RLEARN_REF` variables set to the Github ref/version of the respective libraries. Optionaly you can also add you AWS Access Keys and region in order to use S3 as a data location.
You will need to add a `dev.env` file with at least `PYLEARN_REF`, `RLEARN_REF` and `PRELURN_REF` variables set to the Github ref (branch or tag) of the respective libraries. Optionaly you can also add you AWS Access Keys and region in order to use S3 as a data location.

```console
$ cat dev.env
Expand All @@ -107,11 +97,27 @@ root@1e36bb3275b5:/opt/learn# bash ./install-dependencies --dev
root@1e36bb3275b5:/opt/learn# learn --help
```

#### Testing
### Testing

You can run the tests, which are written with bats, using the following docker compose task:

`docker-compose run tests`

### Contributing
You can also enter the contianer and run specific tests as follows:

```
> dc run dev
root@4d3df46d52c7:/opt/learn# bats tests/integration/
.DS_Store output/ test_cli_discrat.bats test_cli_varsel.bats
input/ test_cli_describe.bats test_cli_lda.bats
root@4d3df46d52c7:/opt/learn# bats tests/integration/test_cli_describe.bats
✓ describe runs and output expected files
1 test, 0 failures
```

## Contributing

- [Python Style Guide](https://www.python.org/dev/peps/pep-0008/)
- [R Style Guide](http://adv-r.had.co.nz/Style.html)
Refer to the [pylearn](https://github.com/tesera/pylearn#contribution-guidelines) and
[rlearn](https://github.com/tesera/rlearn#contribution-guidelines) for guides on how to
contribute.
6 changes: 3 additions & 3 deletions bin/install-dependencies.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ mkdir -p {pysite,rlibs}

install2.r -l $R_LIBS_USER devtools

r ./bin/installGithub2.r tesera/rlearn -d TRUE -r ${RLEARN_REF-master}
r ./bin/installGithub2.r tesera/rlearn -d TRUE -r ${RLEARN_REF-v1.0.1}

pip install --user scipy awscli

pip install --user "git+https://github.com/tesera/pylearn.git@${PYLEARN_REF-master}"
pip install --user "git+https://github.com/tesera/prelurn.git@${PRELURN_REF-master}"
pip install --user "git+https://github.com/tesera/pylearn.git@${PYLEARN_REF-v1.0.1}"
pip install --user "git+https://github.com/tesera/prelurn.git@${PRELURN_REF-v1.0.0}"

rm -rf /tmp/downloaded_packages/ /tmp/*.rds

Expand Down
4 changes: 3 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,7 @@ dev:
test:
container_name: learn
build: .
volumes:
- ./:/opt/learn
env_file: ./env/dev.env
command: ['bats ./tests/integration']
command: ['bats', './tests/integration']
26 changes: 23 additions & 3 deletions learn/clients/analyze.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
import os
import logging

import pandas as pd
from rpy2.robjects import pandas2ri
pandas2ri.activate()

import rpy2.robjects as robjects
from rpy2.robjects.packages import importr

from schema import Schema, And, Or
from pylearn.lda import cohens_khat, combine_evaluation_datasets

from learn.utils import is_s3_url

logger = logging.getLogger('pylearn')
importr('MASS')
importr('logging')
Expand All @@ -35,9 +35,29 @@ def analyze(xy, config, yvar, output):

class Analyze(object):

@staticmethod
def _validate(args):
schema = Schema({
'--xy-data': Or(
os.path.isfile, is_s3_url,
error='<xy_reference_csv> should exist and be readable.'),
'--config': Or(
os.path.isfile, is_s3_url,
error='--config should exist and be readable.'),
'--output': Or(
os.path.exists, is_s3_url,
error='--output should exist and be writable.'),
'--yvar': And(str, len),
}, ignore_extra_keys=True)
args = schema.validate(args)
return args

def run(self, args):
# disable cloudwatch rlearn logging until it is prod ready
rlearn.logger_init(log_toAwslogs=False)

logger.info('Validating args')
args = self._validate(args)
outdir = args['--output']
yvar = args['--yvar']

Expand Down
21 changes: 21 additions & 0 deletions learn/clients/describe.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
import logging
import pandas as pd
import prelurn
from schema import Schema, And, Or, Optional

from learn.utils import is_s3_url

logger = logging.getLogger('pylearn')

Expand All @@ -11,6 +13,25 @@ class Describe(object):
def __init__(self):
pass

@staticmethod
def _validate(args):
schema = Schema({
'--xy-data': Or(
os.path.isfile, is_s3_url,
error='<xy_reference_csv> should exist and be readable.'),
Optional('--quantile-type'): And(
str,
lambda s: s in ('decile', 'quartile'),
error='--config should exist and be readable.'),
'--output': Or(
os.path.exists, is_s3_url,
error='--output should exist and be writable.'),
Optional('--format'): And(str, lambda s: s in ('json', 'csv')),
}, ignore_extra_keys=True)
args = schema.validate(args)
return args


def run(self, args):
logger.info('Running describe')

Expand Down
41 changes: 37 additions & 4 deletions learn/clients/discrating.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import os
import sys
import logging
import pandas as pd

from schema import Schema, And, Or
from pylearn.discrating import predict

from learn.utils import is_s3_url

logger = logging.getLogger('pylearn')

Expand All @@ -13,14 +15,45 @@ class Discrating(object):
def __init__(self):
logger.info("running dicsriminant ratings...")

@staticmethod
def _validate(args):
schema = Schema({
'--xy-data': Or(
os.path.isfile, is_s3_url,
error='<--xy-data should exist and be readable.'),
'--x-data': Or(
os.path.isfile, is_s3_url,
error='--x-data should exist and be readable.'),
'--dfunct': Or(
os.path.isfile, is_s3_url,
error='--dfunct should exist and be readable.'),
'--idf': Or(
os.path.exists, is_s3_url,
error='--idf should exist and be writable.'),
'--output': Or(
os.path.exists, is_s3_url,
error='--output should exist and be writable.'),
'--yvar': And(str, len),
}, ignore_extra_keys=True)
args = schema.validate(args)
return args

def run(self, args):
logger.info("invoking predict with varset: %s", args['--varset'])
varset = int(args['--varset'])
dfunct = pd.read_csv(args['--dfunct'])

# this is a hack to avoid handling this in pylearn right now in pylearn
# an exception should be raised which we can catch when running predict
if varset not in dfunct.VARSET3.unique():
msg = "varset '%d' missing from dfunct" %varset
logger.error(msg)
sys.exit(msg)

pargs = {
'xy': pd.read_csv(args['--xy-data']),
'x_filtered': pd.read_csv(args['--x-data']),
'dfunct': pd.read_csv(args['--dfunct']),
'varset': int(args['--varset']),
'dfunct': dfunct,
'varset': varset,
'yvar': args['--yvar'],
'idf': pd.read_csv(args['--idf']),
}
Expand Down
33 changes: 29 additions & 4 deletions learn/clients/varselect.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
import os
import logging
import pandas as pd
from learn.utils import is_s3_url
from rpy2.robjects import pandas2ri
pandas2ri.activate()

import rpy2.robjects as robjects
from rpy2.robjects.packages import importr

from schema import Schema, And, Or, Optional
from pylearn.varselect import (count_xvars, rank_xvars, extract_xvar_combos,
remove_high_corvar)

Expand All @@ -20,14 +20,16 @@
def var_select(xy, config, args):
args['--nSolutions'], args['--minNvar'], args['--maxNvar'] = args['--iteration'].split(':')

varselect = rlearn.vs_selectVars(xy=xy, config=config,
varselect = rlearn.vs_selectVars(
xy=xy, config=config,
yName=args['--yvar'],
removeRowValue=-1,
removeRowColName='SORTGRP',
improveCriteriaVarName=args['--criteria'],
minNumVar=int(args['--minNvar']),
maxNumVar=int(args['--maxNvar']),
nSolutions=int(args['--nSolutions']))
nSolutions=int(args['--nSolutions'])
)

return pandas2ri.ri2py(varselect)

Expand Down Expand Up @@ -60,6 +62,29 @@ def varselect(data_xy, xy_config, args):

class VarSelect(object):

@staticmethod
def _validate(args):
schema = Schema({
'--xy-data': Or(
os.path.isfile, is_s3_url,
error='<xy_reference_csv> should exist and be readable.'),
'--config': Or(
os.path.isfile, is_s3_url,
error='--config should exist and be readable.'),
Optional('--output'): Or(
os.path.exists, is_s3_url,
error='--output should exist and be writable.'),
Optional('--yvar'): And(str, len),
Optional('--iteration'): And(str, len),
Optional('--criteria'): And(
str,
lambda s: s in ('ccr12', 'Wilkes', 'xi2', 'zeta2')
),
}, ignore_extra_keys=True)

args = schema.validate(args)
return args

def run(self, args):
# disable cloudwatch rlearn logging until it is prod ready
rlearn.logger_init(log_toAwslogs=False)
Expand Down
4 changes: 4 additions & 0 deletions learn/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from urlparse import urlparse

def is_s3_url(url):
return urlparse(url).scheme == 's3'
5 changes: 1 addition & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

setup(
name='learn-cli',
version='0.1.1',
version='1.0.0',
description=u"Learn Model Builder",
classifiers=[],
keywords='',
Expand All @@ -19,9 +19,6 @@
'boto3',
'rpy2'
],
extras_require={
'test': ['pytest'],
},
entry_points={
'console_scripts': [
'learn=learn.cli:cli'
Expand Down
Loading

0 comments on commit 27f93a1

Please sign in to comment.