An efficient workflow for reproducible science

Trevor Bekolay
University of Waterloo
bekolay.org/scipy2013-workflow

* PhD student at the University of Waterloo * I want to talk about mostly how I am currently working such that my science is reproducible. * No overarching tool, just a set of tips. * What I wish I'd known when I started my PhD

Recreatability + openness

⇒ reproducibility

* Reproducibility is the goal, but in order to do that we first need recreatability * Reproducible = read a paper; given that text, independently set up and do the experiment. * Recreatable = given access to everything that went into producing a paper, recreate it. * Recreatability should be a given, but it's not, it's hard * Difficulty with recreatability is understandable, but inexcusable * My thought: make your own work recreatable, release it in the open, and reproducibility will follow

Read literature
Form hypothesis
Try stuff out
Produce a research artifact

* A scientist does a lot of things * This talk is focused on this last part, producing an artifact to be consumed by others * We don't talk about this part enough * You may completely disagree with me * That's great, but provide alternatives

Ideal workflow

Recreatable
Simple (the fewer tools the better)
Fast (able to see the result of changes)

* What does the ideal workflow look like? * Recreatability is number 1 * Sometimes comes at the expense of simplicity or speed * These three are all in conflict * How to get all three of these, or at least close?

git clone
  https://github.com/tbekolay/jneurosci2013.git
python run.py
  download_data convert plot combine all paper

* This is what I've done now * This isn't just a complicated figure, it's a whole 22 page paper with multiple complicated figures * Here's what I learned in getting to that point

1 project : 1 directory

Tip 1

When you start making an artifact, make a new directory

* People consume your research as the artifact * Only include what you did to make that artifact * There will be some duplication, but so what * Also means you can put this in version control * The sooner the better!

Use `virtualenv`

Tip 2

and `virtualenvwrapper`

Use --no-site-packages (the default now)
cd /project/dir && setvirtualenvproject
pip install <package>
pip freeze > requirements.txt

* Wish I had more time to talk about virtualenv! * Trust me: it's worth learning * Install new packages at a whim * When you're done, pip freeze to make a requirements.txt

Make packages from duplicate code

Tip 3

You can never totally get rid of duplicate code
Consider making (pip installable) Python packages

* Give up on having absolutely no duplicate code * Kind of nice to see your progress anyhow * If you repeat a ton, you're doing something novel * Put it on PyPI * PyPI has a lot of crap on it, it'll be fine

Put forgettables
in a `README`

Tip 4

run.py usage
============
download_data -- Downloads data from figshare
convert -- Convert any CSVs in data/ to HDF5

Requirements
============
- libpng (apt-get install libpng-dev)
- python
- pip
- Packages in requirements.txt

* README should contain anything you're worried about forgetting * Write it for yourself

Directory structure

Tip 5

`data`
`figures`
`paper`
`plots`

`scripts`
`requirements.txt`
`run.py`
`README`

* This is (roughly) how a paper gets made * Our directory structure should reflect this * Subdirectories should be clear

Decouple analysis

Tip 6

* Think of analysis as compression * Going from big raw data to small important data * If an analysis needs information from two sources, it's a meta-analysis

Do everything with `run.py`

Tip 7

Like a makefile for your artifact
Force yourself to put everything in here
- subprocess is your friend

* run.py contains the logic to do everything * I mean everything!! * Force yourself to put everything in there * Easy to forget what terminal command you used when you need to do paper revisions

from scripts import analysis, plots, figures
if __name__ == '__main__':

    # Analysis
    results = {}
    for path in glob('data/*'):
        results[path] = analysis.analyze(path)

    # Meta-analysis
    meta_result = analysis.meta_analyze(results)

* Skeleton example * scripts/ has analysis.py, plots.py, figures.py

    # Plots
    plot_files = {}
    for path in results:
        result = results[path]
        plot_files[path] = plots.make_plot(result)

    meta_plot = plots.make_meta_plot(meta_result)

    # Figures
    plot_file = plot_files.values()[0]
    figures.make_figure(plot_file, meta_plot)

Use command line arguments

Tip 8

* run.py is all you should interact with * Make command line arguments for the various things you do with it

Bad!

    SAVE_PLOTS = True
    ...
    plot(data, save=SAVE_PLOTS)

> python run.py
> emacs run.py
# Change SAVE_PLOTS
> python run.py

* This was something I used to do a lot * Every time you open up an editor, you're expending mental energy

Good!

    SAVE_PLOTS = 'save_plots' in sys.argv
    ...
    plot(data, save=SAVE_PLOTS)

> python run.py
> python run.py save_plots

Bonus tip: try [docopt](http://docopt.org/) for advanced cases

* Less energy, after you make the argument * If you need complex stuff, try docopt

Parallelize & cache

Tip 9

Profile first!

* You may not actually have expensive steps * But if you do, you can speed them up easily

    # Analysis
    results = {}
    for path in glob('data/*'):
        results[path] = analysis.analyze(path)

* Here's our analysis snippet from before

> ipcluster start -n 5

    from IPython import parallel

    rc = parallel.Client()
    lview = rc.load_balanced_view()

    results = {}
    for path in glob('data/*'):
        asyncresult = lview.apply(analyze, path)
        results[path] = asyncresult

    for path, asyncresult in results.iteritems():
        results[path] = asyncresult.get()

* In just a handful of extra lines, this is now done in parallel (with IPython.parallel)

    # Plots
    plot_files = {}
    for path in results:
        result = results[path]
        plot_files[path] = plots.make_plot(result)

* Here's our plot snippet from before

    plot_files = {}
    for path in results:
        # data/file1.h5 => plots/file1.svg
        plot_path = 'plots/' + os.path.splitext(
            os.path.basename(path))[0] + ".svg"

        if os.exists(plot_path):
            plot_files[path] = plot_path
        else:
            res = results[path]
            plot_files[path] = plots.make_plot(res)

Bonus tip: release cached analysis data
if raw data is confidential

* Now we're not reduplicating that effort * You may be able to release cached analyses even if raw data is confidential

Put it all online

Tip 10

Let Github or Bitbucket handle web stuff
- Papers should be changeable and forkable anyway
Store source and artifacts separately
- Try figshare or Dryad

* Online repositories are more reliable than your computer * Can have this private, but please consider making it public

tbekolay/jneurosci2013 • data (figshare)

[1:1 projects:directories](#/7)
[Use `virtualenv`](#/8)
[Put good stuff on PyPI](#/9)
[Write a `README`](#/10)
[Directory structure](#/11)

[Decouple analysis](#/12)
[Write a `run.py`](#/13)
[Command line args](#/14)
[Parallelize & cache](#/15)
[Upload code/data](#/16)

bekolay.org/scipy2013-workflow • tbekolay@gmail.com

* I hope these tips were helpful! * My JNeuroscience paper and this presentation are both on Github * Please suggest improvements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slides.md

slides.md

An efficient workflow for reproducible science

Recreatability + openness

⇒ reproducibility

Ideal workflow

1 project : 1 directory

Use `virtualenv`

Make packages from duplicate code

Put forgettables
in a `README`

Directory structure

Decouple analysis

Do everything with `run.py`

Use command line arguments

Parallelize & cache

Put it all online

Files

slides.md

Latest commit

History

slides.md

File metadata and controls

An efficient workflow for reproducible science

Recreatability + openness

⇒ reproducibility

Ideal workflow

1 project : 1 directory

Use virtualenv

Make packages from duplicate code

Put forgettablesin a README

Directory structure

Decouple analysis

Do everything with run.py

Use command line arguments

Parallelize & cache

Put it all online

Use `virtualenv`

Put forgettables
in a `README`

Do everything with `run.py`