Python-based workflows for small-to-medium sized data: what works, what doesn't, and what can be improved

Leonardo Uieda¹, Santiago Soler^2,3

¹Department of Earth, Ocean and Ecological Sciences, School of Environmental Sciences, University of Liverpool, UK
² CONICET, Argentina
³ Instituto Geofísico Sismológico Volponi, UNSJ, Argentina

This is an invited talk for the first part of this session, with 10 minute talks followed by a panel discussion.

	Information
Abstract	U51B-03
Session	U51B - Open Science in Action
When	Friday 17 December 2021 14:00 - 15:00 UTC
Slides	compgeolab.org/agu2021

Abstract

For over a decade there has been a lot of attention devoted to "big data" and the challenges that arise from it. However, many scientists (ourselves included) still mostly deal with data of small to medium size. For the sake of this presentation, we will define this as "data that can fit in the memory of a modest computer" (roughly the order of 10s of gigabytes in 2021). Though perhaps not as exciting, there are still challenges to building reproducible research pipelines for analysing and modelling data of this scale:

Binary data or files larger than a few 10s of megabytes are difficult to manage in version control systems, which are built for code (i.e., plain text).
Python tools tend to evolve at a fast pace, making analysis pipelines difficult to reuse and replicate without significant effort even one or two years later.
Jupyter notebooks are extraordinary for interactive exploration but still come at the sacrifice of collaborative development and workflow automation (though tools like nbflow and jupytext offer a glimpse of a better future).
Publicly available geophysical data and models are relatively common on the internet but often don't have open and permissive licenses or don't clearly indicate that they do.

In this presentation, we will demonstrate the workflow that we have been establishing at the Computer-Oriented Geoscience Lab for building "repro-packs" for our papers and projects. We use a combination of virtual environments, data download and caching tools, notebooks, Makefiles, and data repositories to provide others with the means to reproduce and build upon our work. We will also share some of the unsolved challenges that we have encountered and our dreams for an ideal workflow.

License

This content is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
css		css
js		js
.nojekyll		.nojekyll
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
index.html		index.html
slides.md		slides.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python-based workflows for small-to-medium sized data: what works, what doesn't, and what can be improved

Abstract

License

About

Releases

Packages

Languages

License

compgeolab/agu2021

Folders and files

Latest commit

History

Repository files navigation

Python-based workflows for small-to-medium sized data: what works, what doesn't, and what can be improved

Abstract

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages