Python-based workflows for small-to-medium sized data: what works, what doesn't, and what can be improved
Leonardo Uieda1, Santiago Soler2,3
1Department of Earth, Ocean and Ecological Sciences, School of Environmental Sciences, University of Liverpool, UK
2 CONICET, Argentina
3 Instituto Geofísico Sismológico Volponi, UNSJ, Argentina
This is an invited talk for the first part of this session, with 10 minute talks followed by a panel discussion.
Information | |
---|---|
Abstract | U51B-03 |
Session | U51B - Open Science in Action |
When | Friday 17 December 2021 14:00 - 15:00 UTC |
Slides | compgeolab.org/agu2021 |
For over a decade there has been a lot of attention devoted to "big data" and the challenges that arise from it. However, many scientists (ourselves included) still mostly deal with data of small to medium size. For the sake of this presentation, we will define this as "data that can fit in the memory of a modest computer" (roughly the order of 10s of gigabytes in 2021). Though perhaps not as exciting, there are still challenges to building reproducible research pipelines for analysing and modelling data of this scale:
- Binary data or files larger than a few 10s of megabytes are difficult to manage in version control systems, which are built for code (i.e., plain text).
- Python tools tend to evolve at a fast pace, making analysis pipelines difficult to reuse and replicate without significant effort even one or two years later.
- Jupyter notebooks are extraordinary for interactive exploration but still come at the sacrifice of collaborative development and workflow automation (though tools like nbflow and jupytext offer a glimpse of a better future).
- Publicly available geophysical data and models are relatively common on the internet but often don't have open and permissive licenses or don't clearly indicate that they do.
In this presentation, we will demonstrate the workflow that we have been establishing at the Computer-Oriented Geoscience Lab for building "repro-packs" for our papers and projects. We use a combination of virtual environments, data download and caching tools, notebooks, Makefiles, and data repositories to provide others with the means to reproduce and build upon our work. We will also share some of the unsolved challenges that we have encountered and our dreams for an ideal workflow.
This content is licensed under a Creative Commons Attribution
4.0 International License.