A flexible (but opinionated) toolkit for doing and sharing reproducible data science.
EasyData started life as an experimental fork of cookiecutter-data-science where we could try out ideas before proposing them as fixes to the upstream branch. It has grown into its own toolkit for implementing a reproducible data science workflow, and is the basis of our Bus Number tutorial on Reproducible Data Science.
For a tutorial on making use of a previous version of this framework (available via the bus_number
branch), visit:
https://github.com/hackalog/bus_number/
-
anaconda (or miniconda)
-
python3.6+ (we use f-strings. So should you)
-
Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda config --add channels conda-forge
$ conda install cookiecutter
cookiecutter https://github.com/hackalog/cookiecutter-easydata
The directory structure of your new project looks like this:
LICENSE
Makefile
- top-level makefile. Type
make
for a list of valid commands
- top-level makefile. Type
README.md
- this file
catalog
- Data catalog. This is where config information such as data sources and data transformations are saved
data
- Data directory. often symlinked to a filesystem with lots of space
data/raw
- Raw (immutable) hash-verified downloads
data/interim
- Extracted and interim data representations
data/processed
- The final, canonical data sets for modeling.
docs
- A default Sphinx project; see sphinx-doc.org for details
models
- Trained and serialized models, model predictions, or model summaries
models/trained
- Trained models
models/output
- predictions and transformations from the trained models
notebooks
- Jupyter notebooks. Naming convention is a number (for ordering),
the creator's initials, and a short
-
delimited description, e.g.1.0-jqp-initial-data-exploration
.
- Jupyter notebooks. Naming convention is a number (for ordering),
the creator's initials, and a short
references
- Data dictionaries, manuals, and all other explanatory materials.
reports
- Generated analysis as HTML, PDF, LaTeX, etc.
reports/figures
- Generated graphics and figures to be used in reporting
reports/tables
- Generated data tables to be used in reporting
reports/summary
- Generated summary information to be used in reporting
environment.yml
- (if using conda) The YAML file for reproducing the analysis environment
setup.py
- Turns contents of
MODULE_NAME
into a pip-installable python module (pip install -e .
) so it can be imported in python code
- Turns contents of
MODULE_NAME
- Source code for use in this project.
MODULE_NAME/__init__.py
- Makes MODULE_NAME a Python module
MODULE_NAME/data
- Scripts to fetch or generate data. In particular:
MODULE_NAME/data/make_dataset.py
- Run with
python -m MODULE_NAME.data.make_dataset fetch
orpython -m MODULE_NAME.data.make_dataset process
- Run with
MODULE_NAME/analysis
- Scripts to turn datasets into output products
MODULE_NAME/models
- Scripts to train models and then use trained models to make predictions.
e.g.
predict_model.py
,train_model.py
- Scripts to train models and then use trained models to make predictions.
e.g.
tox.ini
- tox file with settings for running tox; see tox.testrun.org
The first time:
make create_environment
git init
git add .
git commit -m "initial import"
git branch easydata # tag for future easydata upgrades
Subsequent updates:
make update_environment
In case you need to delete the environment later:
conda deactivate
make delete_environment