Proof-of-concept code for using dax_butler – Rubin/LSST Gen3 Butler and PipelineTask framework in SPHEREx pipelines.
Butler organizes datasets (units of stored data) in data repositories, identifying them by a combination of dataset type, data id, and collection. It encapsulates all I/O done by pipeline code.
PipelineTask is a framework for writing and packaging algorithmic code that enables generating a pipeline execution plan, in the form of a directed acyclic graph (DAG). PipelineTask is built on top of Butler.
LSST Butler is a Python 3 only package, which provides data access framework for LSST Data Management team. The main function of such a framework is to organize the discrete entities of stored data to facilitate its search and retrieval.
LSST Butler abstracts:
- where data live on the storage
The data can live in a POSIX datastore, on Amazon S3, or elsewhere. The framework users deal with Python rather than stored representations of data entities. - data format and how to deal with it
Formatters are used to to move between stored and Python representations of data entities. As a result users deal with Python objects, such as astropy Table or CCDData instead of the stored formats, such as VOTable, FITS or HDF5. - calibrations
You ask for a calibration for an image, and it returns you the right file. (Use case: bias reference image - it is averaged over a few days. The algorithmic code does not need to be aware of how the bias reference is obtained.)
- Dataset is a discrete entity of stored data, uniquely identified by a Collection and DatasetRef.
- Collection is an entity that contains Datasets.
- DatasetRef is an identifier for a Dataset.
- Registry is a database that holds metadata and provenance for Datasets.
- Dimension is a concept used to organize and label Datasets. Dimension is analogous to a coordinate axis in coordinate space. Dataset can be viewed as a point in this space with the position is defined by DataCoordinate or data id. For example, SPHEREx raw image might be identified by the observation (pointing of the telescope at a particular time) and the detector array that took this image. For this reason, observation and and detector might be good Dimensions to describe raw image Datasets.
- DatasetType is a named category of Datasets (ex. raw image)
Together, DatasetType and DataCoordinate make a unique Dataset identifier, see DatasetRef.
- Custom python representations of stored Datasets
Verified that we can create a custom Formatter, that maps stored representation of a particular dataset type into its Python representation. (Usingbutler.put()
to store Python object into file system datastore andbutler.get()
to retrieve Python object from the stored file.) - Custom set of Dimensions
Verified that we can organize our Datasets around custom set of Dimensions, see Caveats below. - File ingest (using
butler.ingest()
)
Butler allows to configure file templates (datastore.templates
), which allows to use Collection name, DatasetType, and any of the field in Dimensions tables to create directory structure and file name of Dataset stored representation.
Verified that the data can be ingested into datastore according to the defined template and the transfer type (ex. copy, symlink). - Custom butler command
It is possible to use butler framework to create butler subcommands. Verified this capability by addingingest-simulated
subcommand. - Simple task and example pipeline
Created SubtractTask pipeline task, which accepts two images and subtracts the second from the first. Created an example pipeline that runs this task, seepipelines/ExamplePipeline.yaml
.
Proof-of-concept is designed around Python unit tests that run in a container
on GitHub-hosted machines as a part of GitHub's built-in continuous integration service,
see .github/workflows/unit_test.yaml
Unfortunately, pipeline tasks can not be validated with GitHub actions, because they rely on
pipe_base and ctrl_mpexec packages with
deeper rooted dependencies. Running example pipeline requires installing Rubin/LSST environment, where packages
are managed with EUPS.
Dependency management is one of the main concerns when using Rubin/LSST pipeline framework.
Butler allows to override parts of its configuration. The overwritten configuration is merged with the default configuration. As of November 2020, it's possible to completely overrode dimensions, but not possible to completely replace formatters and storage classes.
There is an implied requirement in ctrl_mpexec
package that instrument
dimension table must have
a reference to instrument class.
Other known issues:
- Butler relies on
lsst.sphgeom
lower level C++ library, which does not support HEALPix pixelization at the moment
To install the latest pipeline distribution lsst_distrib
built by Rubin/LSST project, follow newinstall recipe:
# from an empty directory -
curl -OL https://raw.githubusercontent.com/lsst/lsst/master/scripts/newinstall.sh
# continue a previous failed install, if any, in batch mode, and prefer tarballs
bash newinstall.sh -cbt
source loadLSST.bash
# install weekly 46 for 2020
eups distrib install -t w_2020_46 lsst_distrib
# fix shebangs - tarballs have shebangs encoded at build time that need to be fixed at install time
curl -sSL https://raw.githubusercontent.com/lsst/shebangtron/master/shebangtron | python
# use with tag option if other versions installed: setup -t w_2020_46 lsst_distrib
setup lsst_distrib
You only need to do newinstall when conda base environment changes. Check the last modified date of conda-system.
If newest weekly is installed without running newinstall.sh, the previous versions can be removed
with this script. The script will remove
all packages except locally set up and those with the given tag. Use --dry-run
option to avoid surprises:
pruneTags w_2020_44 --delete-untagged --dry-run
To run the example pipeline defined in these repository follow these steps:
-
Install the latest
lsst_distrib
(see above) -
Set up spherex_butler_poc repository with EUPS package manager:
git checkout https://github.com/Caltech-IPAC/spherex_butler_poc.git
cd spherex_butler_poc
# set up the package in the eups stack
setup -r . -t $USER
# review set up packages (optional)
eups list -s
- Create a directory where the buttler repository will live:
mkdir ../test_spherex
cd ../test_spherex
-
Run SPHEREx simulator to produce simulated files. The simulated files have exposure and detector id embedded in the file names.
-
Create empty butler repository (DATA):
butler create --override --seed-config ../spherex_butler_poc/python/spherex/configs/butler.yaml --dimension-config ../spherex_butler_poc/python/spherex/configs/dimensions.yaml DATA
- Ingest simulated images:
butler ingest-simulated DATA /<abspath>/simulator_files
- Ingest simulated dark current images (the group is set to the ingest date, hence ingesting raw and dark images should be done on the same date):
butler ingest-simulated --regex dark_current.fits --ingest-type dark DATA /<abspath>/simulator_files
- Examine butler database
sqlite3 DATA/spherex.sqlite3
> .header on
> .tables
> select * from file_datastore_records;
> .exit
- Create pipeline execution plan as a qgraph.dot file
pipetask qgraph -p ../spherex_butler_poc/pipelines/ExamplePipeline.yaml --qgraph-dot qgraph.dot -b DATA -i rawexpr,darkr -o subtractr
- Convert
qgraph.dot
intopdf
(graphvis
required):
dot -Tpdf qgraph.dot -o qgraph.pdf
- Run example pipeline:
pipetask run -p ../spherex_butler_poc/pipelines/ExamplePipeline.yaml -b DATA --register-dataset-types -i rawexpr,darkr -o subtractr
- Optionally: rerun replacing (
--replace-run
) and removing (--prune-replaced=purge
) the previous run:
pipetask run -p ../spherex_butler_poc/pipelines/ExamplePipeline.yaml -b DATA -o subtractr --replace-run --prune-replaced=purge
-
Examine butler repository in
DATA
directory -
Explore the contents of butler repository using command line tools:
> butler query-collections DATA
> butler query-collections DATA --collection-type CHAINED
> butler query-collections DATA --flatten-chains subtractr
- make sure you have test data (Git LFS repo) and test scripts:
> git clone https://github.com/lsst/testdata_ci_hsc
> git clone https://github.com/lsst/ci_hsc_gen3
- start up container:
> docker run -it -v `pwd`:/home/lsst/mnt docker.io/lsstsqre/centos:7-stack-lsst_distrib-w_latest
- in container:
$ source /opt/lsst/software/stack/loadLSST.bash
$ setup lsst_distrib
$ cd /home/lsst/mnt
$ setup -j -r testdata_ci_hsc
$ setup -j -r ci_hsc_gen3
$ echo $TESTDATA_CI_HSC_DIR; echo $CI_HSC_GEN3_DIR
$ cd ci_hsc_gen3/$ scons
$ sqlite3_analyzer /home/lsst/mnt/ci_hsc_gen3/DATA/gen3.sqlite3