================================================
      ____  _
     / __ \(_)___  ____ ____  ____  ___  _____
    / / / / / __ \/ __ `/ _ \/ __ \/ _ \/ ___/
   / /_/ / / /_/ / /_/ /  __/ / / /  __(__  )
  /_____/_/\____/\__, /\___/_/ /_/\___/____/
                /____/

================================================

Introduction

Diogenes is a a Python library and workflow templet for machine learning. Principally it wraps sklearn providing enhanced functionality and simplified interface of often used workflows.

Example

%matplotlib inline
import diogenes
import numpy as np

Get data from wine quality data set

data = diogenes.read.open_csv_url(
    'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv',
    delimiter=';')

Note that data is a Numpy structured array We can use it like this:

data.dtype.names

('fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality')

print data.shape

(4898,)

print data['fixed acidity']

[ 7.   6.3  8.1 ...,  6.5  5.5  6. ]

We separate our labels from the rest of the data and turn our labels into binary classes.

labels = data['quality']
labels = labels < np.average(labels)
print labels

[False False False ..., False False False]

Remove the labels from the rest of our data

M = diogenes.modify.remove_cols(data, 'quality')
print M.dtype.names

('fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol')

Print summary statistics for our features

diogenes.display.pprint_sa(diogenes.display.describe_cols(M))

            Column Name Count            Mean     Standard Dev Minimum Maximum
 0        fixed acidity  4898   6.85478766844   0.843782079126     3.8    14.2
 1     volatile acidity  4898  0.278241118824   0.100784258542    0.08     1.1
 2          citric acid  4898  0.334191506737    0.12100744957     0.0    1.66
 3       residual sugar  4898   6.39141486321    5.07153998933     0.6    65.8
 4            chlorides  4898 0.0457723560637  0.0218457376851   0.009   0.346
 5  free sulfur dioxide  4898   35.3080849326    17.0054011058     2.0   289.0
 6 total sulfur dioxide  4898   138.360657411    42.4937260248     9.0   440.0
 7              density  4898   0.99402737648 0.00299060158215 0.98711 1.03898
 8                   pH  4898   3.18826663944   0.150985184312    2.72    3.82
 9            sulphates  4898  0.489846876276   0.114114183106    0.22    1.08
10              alcohol  4898   10.5142670478    1.23049493654     8.0    14.2

Plot correlation between features

fig = diogenes.display.plot_correlation_matrix(M)

Arrange an experiment trying different classifiers

exp = diogenes.grid_search.experiment.Experiment(
    M,
    labels,
    clfs=diogenes.grid_search.standard_clfs.std_clfs)

Make a pdf report

exp.make_report(verbose=False)

/Library/Python/2.7/site-packages/sklearn/svm/base.py:204: ConvergenceWarning: Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)
/Library/Python/2.7/site-packages/sklearn/svm/base.py:204: ConvergenceWarning: Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)

'/Users/zar1/dssg/diogenes/report.pdf'

Find the trial with the best score and make an ROC curve

trials_with_score = exp.average_score()
best_trial, best_score = max(trials_with_score.iteritems(), key=lambda trial_and_score: trial_and_score[1])
print best_trial
print best_score

Trial(clf=<class 'sklearn.ensemble.forest.RandomForestClassifier'>, clf_params={'n_estimators': 50, 'max_features': 'sqrt', 'n_jobs': 1, 'max_depth': 7}, subset=<class 'diogenes.grid_search.subset.SubsetNoSubset'>, subset_params={}, cv=<class 'sklearn.cross_validation.KFold'>, cv_params={})
0.756236767007

fig = best_trial.roc_curve()

Installation

pip install git+git://github.com/dssg/diogenes.git

Required Packages

Python packages

Python 2.7
Numpy 1.10.1
scikit-learn
pandas
SQLAlchemy
pdfkit
plotlib

Other packages

wkhtmltopdf

Next Steps

Check out the documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

Introduction

Example

Installation

Required Packages

Python packages

Other packages

Next Steps

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

Introduction

Example

Installation

Required Packages

Python packages

Other packages

Next Steps