Skip to content

Latest commit

 

History

History
448 lines (328 loc) · 9.91 KB

slides.md

File metadata and controls

448 lines (328 loc) · 9.91 KB

An efficient workflow for reproducible science

Trevor Bekolay
University of Waterloo
bekolay.org/scipy2013-workflow

* PhD student at the University of Waterloo * I want to talk about mostly how I am currently working such that my science is reproducible. * No overarching tool, just a set of tips. * What I wish I'd known when I started my PhD

Recreatability + openness

⇒ reproducibility

* Reproducibility is the goal, but in order to do that we first need recreatability * Reproducible = read a paper; given that text, independently set up and do the experiment. * Recreatable = given access to everything that went into producing a paper, recreate it. * Recreatability should be a given, but it's not, it's hard * Difficulty with recreatability is understandable, but inexcusable * My thought: make your own work recreatable, release it in the open, and reproducibility will follow
  1. Read literature
  2. Form hypothesis
  3. Try stuff out
  4. Produce a research artifact
* A scientist does a lot of things * This talk is focused on this last part, producing an artifact to be consumed by others * We don't talk about this part enough * You may completely disagree with me * That's great, but provide alternatives

Ideal workflow

  1. Recreatable
  2. Simple (the fewer tools the better)
  3. Fast (able to see the result of changes)
* What does the ideal workflow look like? * Recreatability is number 1 * Sometimes comes at the expense of simplicity or speed * These three are all in conflict * How to get all three of these, or at least close?
git clone
  https://github.com/tbekolay/jneurosci2013.git
python run.py
  download_data convert plot combine all paper

* This is what I've done now * This isn't just a complicated figure, it's a whole 22 page paper with multiple complicated figures * Here's what I learned in getting to that point

1 project : 1 directory

Tip 1
  • When you start making an artifact, make a new directory
* People consume your research as the artifact * Only include what you did to make that artifact * There will be some duplication, but so what * Also means you can put this in version control * The sooner the better!

Use virtualenv

Tip 2
and `virtualenvwrapper`
  1. Use --no-site-packages (the default now)
  2. cd /project/dir && setvirtualenvproject
  3. pip install <package>
  4. pip freeze > requirements.txt
* Wish I had more time to talk about virtualenv! * Trust me: it's worth learning * Install new packages at a whim * When you're done, pip freeze to make a requirements.txt

Make packages from duplicate code

Tip 3
  • You can never totally get rid of duplicate code
  • Consider making (pip installable) Python packages
* Give up on having absolutely no duplicate code * Kind of nice to see your progress anyhow * If you repeat a ton, you're doing something novel * Put it on PyPI * PyPI has a lot of crap on it, it'll be fine

Put forgettables
in a README

Tip 4
run.py usage
============
download_data -- Downloads data from figshare
convert -- Convert any CSVs in data/ to HDF5

Requirements
============
- libpng (apt-get install libpng-dev)
- python
- pip
- Packages in requirements.txt
* README should contain anything you're worried about forgetting * Write it for yourself

Directory structure

Tip 5
  • `data`
  • `figures`
  • `paper`
  • `plots`
  • `scripts`
  • `requirements.txt`
  • `run.py`
  • `README`
* This is (roughly) how a paper gets made * Our directory structure should reflect this * Subdirectories should be clear

Decouple analysis

Tip 6
* Think of analysis as compression * Going from big raw data to small important data * If an analysis needs information from two sources, it's a meta-analysis

Do everything with run.py

Tip 7
  • Like a makefile for your artifact
  • Force yourself to put everything in here
    • subprocess is your friend
* run.py contains the logic to do everything * I mean everything!! * Force yourself to put everything in there * Easy to forget what terminal command you used when you need to do paper revisions
from scripts import analysis, plots, figures
if __name__ == '__main__':

    # Analysis
    results = {}
    for path in glob('data/*'):
        results[path] = analysis.analyze(path)

    # Meta-analysis
    meta_result = analysis.meta_analyze(results)
* Skeleton example * scripts/ has analysis.py, plots.py, figures.py
    # Plots
    plot_files = {}
    for path in results:
        result = results[path]
        plot_files[path] = plots.make_plot(result)

    meta_plot = plots.make_meta_plot(meta_result)

    # Figures
    plot_file = plot_files.values()[0]
    figures.make_figure(plot_file, meta_plot)

Use command line arguments

Tip 8
* run.py is all you should interact with * Make command line arguments for the various things you do with it

Bad!

    SAVE_PLOTS = True
    ...
    plot(data, save=SAVE_PLOTS)
> python run.py
> emacs run.py
# Change SAVE_PLOTS
> python run.py
* This was something I used to do a lot * Every time you open up an editor, you're expending mental energy

Good!

    SAVE_PLOTS = 'save_plots' in sys.argv
    ...
    plot(data, save=SAVE_PLOTS)
> python run.py
> python run.py save_plots
Bonus tip: try [docopt](http://docopt.org/) for advanced cases
* Less energy, after you make the argument * If you need complex stuff, try docopt

Parallelize & cache

Tip 9
  • Profile first!
* You may not actually have expensive steps * But if you do, you can speed them up easily
    # Analysis
    results = {}
    for path in glob('data/*'):
        results[path] = analysis.analyze(path)
* Here's our analysis snippet from before
> ipcluster start -n 5
    from IPython import parallel

    rc = parallel.Client()
    lview = rc.load_balanced_view()

    results = {}
    for path in glob('data/*'):
        asyncresult = lview.apply(analyze, path)
        results[path] = asyncresult

    for path, asyncresult in results.iteritems():
        results[path] = asyncresult.get()
* In just a handful of extra lines, this is now done in parallel (with IPython.parallel)
    # Plots
    plot_files = {}
    for path in results:
        result = results[path]
        plot_files[path] = plots.make_plot(result)
* Here's our plot snippet from before
    plot_files = {}
    for path in results:
        # data/file1.h5 => plots/file1.svg
        plot_path = 'plots/' + os.path.splitext(
            os.path.basename(path))[0] + ".svg"

        if os.exists(plot_path):
            plot_files[path] = plot_path
        else:
            res = results[path]
            plot_files[path] = plots.make_plot(res)
Bonus tip: release cached analysis data
if raw data is confidential
* Now we're not reduplicating that effort * You may be able to release cached analyses even if raw data is confidential

Put it all online

Tip 10
  • Let Github or Bitbucket handle web stuff
    • Papers should be changeable and forkable anyway
  • Store source and artifacts separately
* Online repositories are more reliable than your computer * Can have this private, but please consider making it public

tbekolay/jneurosci2013data (figshare)


  1. [1:1 projects:directories](#/7)
  2. [Use `virtualenv`](#/8)
  3. [Put good stuff on PyPI](#/9)
  4. [Write a `README`](#/10)
  5. [Directory structure](#/11)
  1. [Decouple analysis](#/12)
  2. [Write a `run.py`](#/13)
  3. [Command line args](#/14)
  4. [Parallelize & cache](#/15)
  5. [Upload code/data](#/16)

bekolay.org/scipy2013-workflow[email protected]

* I hope these tips were helpful! * My JNeuroscience paper and this presentation are both on Github * Please suggest improvements!