Trevor Bekolay
University of Waterloo
bekolay.org/scipy2013-workflow
- Read literature
- Form hypothesis
- Try stuff out
- Produce a research artifact
- Recreatable
- Simple (the fewer tools the better)
- Fast (able to see the result of changes)
git clone
https://github.com/tbekolay/jneurosci2013.git
python run.py
download_data convert plot combine all paper
Tip 1
- When you start making an artifact, make a new directory
Tip 2
and `virtualenvwrapper`
- Use
--no-site-packages
(the default now) cd /project/dir && setvirtualenvproject
pip install <package>
pip freeze > requirements.txt
Tip 3
- You can never totally get rid of duplicate code
- Consider making (
pip
installable) Python packages
Tip 4
run.py usage
============
download_data -- Downloads data from figshare
convert -- Convert any CSVs in data/ to HDF5
Requirements
============
- libpng (apt-get install libpng-dev)
- python
- pip
- Packages in requirements.txt
Tip 5
- `data`
- `figures`
- `paper`
- `plots`
- `scripts`
- `requirements.txt`
- `run.py`
- `README`
Tip 6
* Think of analysis as compression
* Going from big raw data to small important data
* If an analysis needs information from two
sources, it's a meta-analysis
Tip 7
- Like a makefile for your artifact
- Force yourself to put everything in here
subprocess
is your friend
from scripts import analysis, plots, figures
if __name__ == '__main__':
# Analysis
results = {}
for path in glob('data/*'):
results[path] = analysis.analyze(path)
# Meta-analysis
meta_result = analysis.meta_analyze(results)
# Plots
plot_files = {}
for path in results:
result = results[path]
plot_files[path] = plots.make_plot(result)
meta_plot = plots.make_meta_plot(meta_result)
# Figures
plot_file = plot_files.values()[0]
figures.make_figure(plot_file, meta_plot)
Tip 8
* run.py is all you should interact with
* Make command line arguments
for the various things you do with it
Bad!
SAVE_PLOTS = True
...
plot(data, save=SAVE_PLOTS)
> python run.py
> emacs run.py
# Change SAVE_PLOTS
> python run.py
Good!
SAVE_PLOTS = 'save_plots' in sys.argv
...
plot(data, save=SAVE_PLOTS)
> python run.py
> python run.py save_plots
Bonus tip: try [docopt](http://docopt.org/) for advanced cases
* Less energy, after you make the argument
* If you need complex stuff, try docopt
Tip 9
- Profile first!
# Analysis
results = {}
for path in glob('data/*'):
results[path] = analysis.analyze(path)
> ipcluster start -n 5
from IPython import parallel
rc = parallel.Client()
lview = rc.load_balanced_view()
results = {}
for path in glob('data/*'):
asyncresult = lview.apply(analyze, path)
results[path] = asyncresult
for path, asyncresult in results.iteritems():
results[path] = asyncresult.get()
# Plots
plot_files = {}
for path in results:
result = results[path]
plot_files[path] = plots.make_plot(result)
plot_files = {}
for path in results:
# data/file1.h5 => plots/file1.svg
plot_path = 'plots/' + os.path.splitext(
os.path.basename(path))[0] + ".svg"
if os.exists(plot_path):
plot_files[path] = plot_path
else:
res = results[path]
plot_files[path] = plots.make_plot(res)
Bonus tip: release cached analysis data
if raw data is confidential
* Now we're not reduplicating that effort
* You may be able to release cached analyses
even if raw data is confidential
if raw data is confidential
Tip 10
- Let Github or
Bitbucket handle web stuff
- Papers should be changeable and forkable anyway
- Store source and artifacts separately
tbekolay/jneurosci2013 • data (figshare)
- [1:1 projects:directories](#/7)
- [Use `virtualenv`](#/8)
- [Put good stuff on PyPI](#/9)
- [Write a `README`](#/10)
- [Directory structure](#/11)
- [Decouple analysis](#/12)
- [Write a `run.py`](#/13)
- [Command line args](#/14)
- [Parallelize & cache](#/15)
- [Upload code/data](#/16)
* I hope these tips were helpful! * My JNeuroscience paper and this presentation are both on Github * Please suggest improvements!