title | theme | revealOptions | ||||||
---|---|---|---|---|---|---|---|---|
CSVConf 2021 |
solarized |
|
Mike Trizna
Smithsonian OCIO Data Science Lab
May 4, 2021 | csv,conf,v6
Yes, there are the museums (19 of them, mostly in Washington, DC), but we also have 21 libraries and archives, 9 research centers ... and a zoo.
Founded in 1846 from the bequest of Englishman James Smithson with the condition:
"under the name of the Smithsonian Institution, an establishment for the increase and diffusion of knowledge."
The Smithsonian has been increasing and diffusing Knowledge since 1846, but what about all of that Data?
All of that data and info that fed into knowledge, insight, and wisdom were dutifully cataloged and stored.
February 25, 2020
Of the Smithsonian’s 155 million objects, 2.1 million library volumes and 156,000 cubic feet of archival collections:
- 2.8 million 2-D and 3-D images
- Over 17 million collection metadata objects
Before February 2020, all Smithsonian museums and units made their data searchable and sometimes able to download, but through each individual unit. Many different use agreements.
SI Open Access put all media and metadata in one place, and all Open Access media is CC0.
I will cover 3 different ways.
All 3 share metadata records in the same deeply-nested JSON structure.
🔗: http://edan.si.edu/openaccess/apidocs/
- API Key needed (but free and painless to register)
- Great for getting a feel for record structure
- Records are extensively indexed, but can only search indexed fields.
- Row limit of 1000 per API call
2017 paper that described building a machine learning model to detect herbarium sheets that had been stained with mercury.
I wanted to create a new model on same dataset (2017 is ancient history in Machine Learning)
All training images are shared on Figshare, but photos are resized and I wanted original metadata
Unfortunately the "barcode" term from the supplementary materials is not an indexed field.
- Files are serialized as line-delimited JSON and compressed with bzip2.
- Directories are organized by owning unit and files are distributed by first two characters of content serialization hash.
If I'm looking across all units, that's 9,728 files to process!
Dask lets you set up a mini cluster on your machine ... or on an actual compute cluster
Dask is more well-known for parallel processing of DataFrames, but it also contains a really useful catch-all "Bag" type.
Full interactive notebook (through Binder) available at https://github.com/sidatasciencelab/siopenaccess.
https://github.com/MikeTrizna/CSVConf2021_siopenaccess
<style> .container{ display: flex; align-items: center; justify-content: center; } .col{ flex: 1; } </style>