Skip to content

Invited presentation for the "Open Science in Action" session at AGU2021

License

Notifications You must be signed in to change notification settings

compgeolab/agu2021

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python-based workflows for small-to-medium sized data: what works, what doesn't, and what can be improved

Leonardo Uieda1, Santiago Soler2,3

1Department of Earth, Ocean and Ecological Sciences, School of Environmental Sciences, University of Liverpool, UK
2 CONICET, Argentina
3 Instituto Geofísico Sismológico Volponi, UNSJ, Argentina

This is an invited talk for the first part of this session, with 10 minute talks followed by a panel discussion.

Information
Abstract U51B-03
Session U51B - Open Science in Action
When Friday 17 December 2021 14:00 - 15:00 UTC
Slides compgeolab.org/agu2021

Abstract

For over a decade there has been a lot of attention devoted to "big data" and the challenges that arise from it. However, many scientists (ourselves included) still mostly deal with data of small to medium size. For the sake of this presentation, we will define this as "data that can fit in the memory of a modest computer" (roughly the order of 10s of gigabytes in 2021). Though perhaps not as exciting, there are still challenges to building reproducible research pipelines for analysing and modelling data of this scale:

  1. Binary data or files larger than a few 10s of megabytes are difficult to manage in version control systems, which are built for code (i.e., plain text).
  2. Python tools tend to evolve at a fast pace, making analysis pipelines difficult to reuse and replicate without significant effort even one or two years later.
  3. Jupyter notebooks are extraordinary for interactive exploration but still come at the sacrifice of collaborative development and workflow automation (though tools like nbflow and jupytext offer a glimpse of a better future).
  4. Publicly available geophysical data and models are relatively common on the internet but often don't have open and permissive licenses or don't clearly indicate that they do.

In this presentation, we will demonstrate the workflow that we have been establishing at the Computer-Oriented Geoscience Lab for building "repro-packs" for our papers and projects. We use a combination of virtual environments, data download and caching tools, notebooks, Makefiles, and data repositories to provide others with the means to reproduce and build upon our work. We will also share some of the unsolved challenges that we have encountered and our dreams for an ideal workflow.

License

Creative Commons License
This content is licensed under a Creative Commons Attribution 4.0 International License.

About

Invited presentation for the "Open Science in Action" session at AGU2021

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published