Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45371: technote describing possible cell-based coadd file layouts #1

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@ File Formats and Layouts for Cell-based Coadds
DMTN-294
========

Rubin's deep coadds will be built in on a grid of small cells, in which each cell has an approximately constant PSF. Cells will have "inner regions" that can be stitched together to form the full coadd, but they will also have outer regions that overlap (neighboring cells will have their own versions of some of the same pixels), in order to allow convolutions and other operations that require padding to be performed rigorously cell by cell. This creates a problem for how to store a coadd in an on-disk FITS file: we want a layout that can be easily interpreted by third-party readers, but we also need to support compression and efficient subimage reads of at least the inner cell region. This technical note will summarize various possibilities and their advantages and disadvantages.
Rubin's deep coadds will be built on a grid of small cells, in which each cell has an approximately constant PSF.
Cells will have "inner regions" that can be stitched together to form the full coadd, but they will also have outer regions that overlap (neighboring cells will have their own versions of some of the same pixels), in order to allow convolutions and other operations that require padding to be performed rigorously cell by cell.
This creates a problem for how to store a coadd in an on-disk FITS file: we want a layout that can be easily interpreted by third-party readers, but we also need to support compression and efficient subimage reads of at least the inner cell region.
This technical note will summarize various possibilities and their advantages and disadvantages.

**Links:**

Expand Down
144 changes: 142 additions & 2 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,149 @@ File Formats and Layouts for Cell-based Coadds

.. abstract::

Rubin's deep coadds will be built in on a grid of small cells, in which each cell has an approximately constant PSF. Cells will have "inner regions" that can be stitched together to form the full coadd, but they will also have outer regions that overlap (neighboring cells will have their own versions of some of the same pixels), in order to allow convolutions and other operations that require padding to be performed rigorously cell by cell. This creates a problem for how to store a coadd in an on-disk FITS file: we want a layout that can be easily interpreted by third-party readers, but we also need to support compression and efficient subimage reads of at least the inner cell region. This technical note will summarize various possibilities and their advantages and disadvantages.
Rubin's deep coadds will be built on a grid of small cells, in which each cell has an approximately constant PSF.
Cells will have "inner regions" that can be stitched together to form the full coadd, but they will also have outer regions that overlap (neighboring cells will have their own versions of some of the same pixels), in order to allow convolutions and other operations that require padding to be performed rigorously cell by cell.
TallJimbo marked this conversation as resolved.
Show resolved Hide resolved
This creates a problem for how to store a coadd in an on-disk FITS file: we want a layout that can be easily interpreted by third-party readers, but we also need to support compression and efficient subimage reads of at least the inner cell region.
This technical note will summarize various possibilities and their advantages and disadvantages.

Add content here
Goals and Requirements
======================

Rubin's cell-based coadds will need to store five or more image planes that share a single coordinate system and pixel grid:

- a floating point main data image;
- an integer bitmask;
- a floating point variance plane, with per-pixel variance estimates that include photon noise from sources;
- a floating point "interpolation fraction" image that records fractional missing data in each pixel;
- at least one floating point Monte Carlo noise realization.

It will need to store at least one PSF model image for each cell; these will be smaller than the other per-cell images but the grid will have the same shape.
To account for chromatic PSF effects we may also store images that correspond to the derivatives of the PSF with respect to some proxy for object SED, in at least some bands.

In addition, we'll need to store considerable structured metadata, including a WCS, information about the grid structure, coadded aperture corrections and wavelength-dependent throughput, the visit and detector IDs of the epochs that contributed to each cell, and additional information about the pixel uncertainty (some measure of typical covariance, and an approximately constant variance that either averages or does not include photon noise from sources).

TallJimbo marked this conversation as resolved.
Show resolved Hide resolved
We will assume throughout this note that we're writing FITS files: regardless of any other format Rubin might support, we have to support FITS as well, so it's just more work to do anything else.

These files will be one per "patch", which we assume to be an approximately 4k x 4k image, divided into cells with an inner region that is approximately 150x150 and 50 pixel padding on all sides.
We will only consider layouts that save the entire coadd, including the outer cell regions.

We need these files to be readable over both POSIX filesystems and S3 and webdav object stores, and we need to be able to read subimages efficiently over all of these storage systems (i.e. we cannot afford to read the entire file just to read a subimage).
Writing to just POSIX filesystems is adequate, as there's no problem with writing to local temporary storage and then uploading separately in our pipeline architecture.
We do not expect to need the ability to extend existing files.

We expect to compress all image planes with at least lossless compression, and would like to have the capability to perform lossy compression on some planes as well.
If we do lossy compression, we would quite likely want to do our own quantization (our `afw` library already has code for this that we prefer over CFITSIO's).
DESC has expressed an interest in applying lossy compression to any coadd files they transfer to NERSC, if we have not done so already.

Constraints from FITS
=====================

Because we need to be able to read subimages efficiently, file-level compression is not an option: our only options are FITS image "tile compression" and perhaps lesser-known FITS 4.0's binary table compression.
Combined with our need to subimage reads over object stores, this already puts us beyond what third-party FITS libraries can do:

- CFITSIO can do image tile compression, including efficient subimage reads that take advantage of the tile structure, but only on POSIX filesystems (or blocks of contiguous memory that correspond to the on-disk layout).
It can also do subset reads of FITS binary tables, again only on POSIX filesystems.
This imposes the same limitation on its Python bindings in the ``fitsio`` package.

- Astropy can do image tile compression with efficient subimage reads on POSIX filesystems, and it can do uncompressed reads (including efficient subimage reads) against both POSIX and object stores via `fsspec`, but it delegates to vendored CFITSIO code for its tile compression code and does not support the combination of these two.
Astropy cannot do subset reads of FITS binary tables, even on POSIX filesystems.

- I am not aware of any library that can do FITS binary table compression at all.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in fact CFITSIO (C) and nom.tam.fits (Java) both support it. It appears that the CFITSIO documentation is behind the code.

Note that CFITSIO was only recently formally released as open-source code on GitHub (there were semi-bootleg versions of it previously) and people may not be entirely familiar with the latest version.

Some links:

https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/fitsio.h#L2069-L2070

https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/imcompress.c#L8704

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently lossy table compression is not supported.

The permitted algorithms for compressing BINTABLE columns are ’RICE 1’, ’GZIP 1’, and ’GZIP 2’ (plus ’NOCOMPRESS’), which are lossless and are described in Sect. 10.4. Lossy compression could be allowed in the future once a process is defined to preserve the details of the compression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


As a result, I expect us to need to write some low-level code of our own to do subimage reads over object stores, though this is not specific to cell-based coadds: we need this for all of our image data products.
Whether to contribute more general code upstream (probably to Astropy) or focus only on reading the specific files we write ourselves is an important open question; the general solution would be very useful the community, but its scope is unquestionably larger.

Finally, the WCS of any single cell (or the inner-cell stitched image) will be simple and can be exactly represented in FITS headers (unlike the WCSs of our single-epoch images).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be very clear, the WCS of a single cell is simply that of the tract. The layout has only one WCS and each cell is simply some subimage within that WCS. This feature makes the overlap pixels be literally the same geometrically, means that the images can be stitched together correctly, etc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's conceptually true but if the cells are stored independently of each other in the FITS file, they do technically have separate WCSes and will at least need different CRPIXn values in order to participate in the tract-global WCS correctly.

(Of course the overlap pixels don't have the same values, since there's a different input image selection for each cell.)

An additional FITS WCS could be used to describe the position of a single-cell image within a larger grid.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This position is simply a pixel location and the dimension of the image.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Jim is referring to the Rubin convention of having an "alternative WCS" (we usually use WCS A) that provides the mapping from local FITS 1-based pixel coordinates to tract-level 0-based coordinates.

It is highly desirable that any images we store in our FITS files have both to allow third-party FITS readers to fully interpret them.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:
"In some cases we might be using corners of the current FITS standard (v4.00) that existing community tools do not yet fully implement. We consider this preferable to using entirely Rubin-unique formats and conventions, and we are prepared to support the community to some extent in filling out these corners (e.g., by adding support to Astropy and Firefly).

We are aware of other projects' interest in FITS representations for large numbers of small-ish images with associated metadata and will keep them informed of what we are planning."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little uncomfortable with the parenthetical, as I don't think we'll actually have the effort to add support to at least Astropy on the early-science timescale.


Image Data Layouts
==================

This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.

Binary Table With Per-Cell Rows

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are aware, but the FITS tile compression code fpack ends up with a binary table of compressed images. In DES, we get pretty good (lossy) compression with the tiling working per cell. We found pretty apparent issues with lossy compression when the tile compression boundaries crossed cell boundaries because the noise levels in the images change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really makes me wonder why we aren't using HDF5 and making use of a hierarchy and native compression rather than trying to shoe horn this into FITS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, we can also do HDF5, but we really can't just do HDF5, according to our requirements. Maybe there's some scheme where we only store the HDF5 and convert to FITS when streaming bytes in various places, but I'm not at all convinced that removes enough constraints from the FITS version to make it easier to implement overall.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really makes me wonder why we aren't using HDF5 and making use of a hierarchy and native compression rather than trying to shoe horn this into FITS.

This is an interesting perspective. Given that FITS already provides a solution that generates this kind of data structure internally and has all of the support to read subsets etc., it seems like it'd be shoe horning to put this into HDF5. One would have to build all of the logic to interact with the table yourself when something already exists and is a defacto community standard.

It seems like the only issue with simply using what FITS has is that one cannot use object stores / non-posix I/O to read subsets of the file. That seems like the kind of use case that is specific to DM, is solvable with custom code that can be written just for DM's use, and results in reusing essentially everything FITS already offers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough. cc/ @gpdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cells are one parallelization axis, sure. But they're not the only one; parellelizing over objects is another, and we'll have enough data that we can keep all the machines plenty hot parallelizing only over patches. I'd generally say that parallelizing over cells is best only if you have your own reason to do per-cell detection and hence have no choice but to deal with the problem of resolving duplicate detections later. Otherwise it's a lot cleaner to combine data at the pixel level, especially if you're starting from our main Object table's detections.

Copy link
Member Author

@TallJimbo TallJimbo Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a service to do a bulk data transfer?

We don't, and "bulk" has implications that we don't want to get into. But services that transform or augment images as they are downloaded is something we've talked about for a long time.

My main concern with only providing FITS via a service is that I think there's still a lot of work defining the format of the FITS file we return. Stitching together the inner cells is not enough to make a coadd look just like a single-epoch image, at least not when you're thinking about the on-disk data model rather than a Python API. I don't want to put a lot of effort into an HDF5 format for cell-based coadds (especially not if I'm also putting a lot of effort into FITS+JSON serialization of other image data products), only to have the scope of what people want back from the FITS-transformer service creep up to near parity with what's in the HDF5.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj wrote:

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

I think that the binary table model is a reasonable one. It solves one of the biggest problems for interoperable, non-mission-specific interpretation of FITS files with enormous numbers of HDUs: there's nothing in the FITS standard, and precious little in the "registered conventions", for having a "table of contents" extension that tells you what's in the file.

By comparison, putting the subimages in a table solves this problem for free. You can put all sorts of metadata into additional columns in the table, and a reasonable application will give a user the ability to browse by this metadata.

It's true that today if you handed a big Green Bank style table to Firefly or Aladin you might not get acceptable UX, but that's because there's been little pressure on tools in this area to catch up with what's already in the standard.

It's important to recognize that Rubin isn't the only project with this need, and cell-based coadds aren't even the only application for it within Rubin: we also have an equally important requirement to be able to create multi-epoch bulk cutout files, both for cutouts around many objects, and for cutouts around one (possibly moving) object in many epochs.

So we are among friends, as it were, in looking for standards-conformant improvements in tool behavior.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

Interesting. Do we have a service to do a bulk data transfer? Such a conversion could happen in the process, since we don't want to store the trimmed/stitched image at USDF and the destination may not have Science Pipelines installed to read our custom datastructure.

We have a robust framework for retrieving "related data" in the RSP. When someone queries for coadded images, they'll get a list of images, most naturally, in the way they are chunked in the Butler (e.g., currently, one row per patch in the query result).

The way the image itself is then retrieved actually allows for N different products to be returned associated with that row in the query result. So you can return the full cell-based coadd FITS file verbatim from the backing store, or you can return a stitched-together simple FITS file for the whole area it covers, or you can return a PNG preview of it, or a table of its provenance, or pretty much anything else we can think of.

These additional related products can be things that already exist statically, or things that are created on the fly by services from the data on the backing store.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and modern IVOA-flavored tools like Firefly and Aladin already know how to follow these links and present these options to users.

-------------------------------

Since FITS binary tables can have array columns with arbitrary shapes, the most intuitive layout for cell-based image data is probably a single binary table HDU that has each image plane as a separate 2-d array column.
TallJimbo marked this conversation as resolved.
Show resolved Hide resolved
A FITS WCS can be stored with each row, using "Green Bank" convention to translate header keys to table columns; this may be sufficient for third-party readers to fully interpret our files and even stitch together cells on the fly (in the case of image viewers).

It's not clear whether any extant FITS viewers could *already* do this in an intuitive way, but by relying on an existing convention it'd be reasonable for them to add it, even if they were otherwise disinclined to add support for observatory-specific file formats.
TallJimbo marked this conversation as resolved.
Show resolved Hide resolved

The Green Bank convention makes no mention of having multiple images as different columns in the same table, however, so we may want to consider putting different image planes in different binary table HDUs, each with corresponding rows, and probably some columns - like the WCS ones - duplicated.
The extra storage cost of another header and some duplicated columns is insignificant, and it may be desirable to put multiple cells from the same plane closer together on disk than different planes of the same cell, especially since we expect that subimage reads of some planes (i.e. the image data image) will be much more popular than others (e.g. noise realizations).

PSF images fit naturally in this layout, as there's nothing wrong with having some array columns having a different shape.
There might be some confusion with a Green Bank convention WCS, though, if we go with a single table with different columns for different image planes as well as the PSF; there's no way to mark which image columns the WCS applies to.

The main problem with binary table layouts is compression support.
The latest FITS standard does include (in section 10.3) a discussion of "tiled table compression", which seems like it'd be sufficient, as long as compressing one cell at a time is enough to get a good compression ratio (this is unclear).
Unlike image tile compression, binary table tile compression doesn't support lossy compression algorithms or dithered quantization, but it would still be possible to do our own non-dithered quantization and use ``BZERO`` and ``BSCALE`` to record the transformation from integer back to floating-point.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This should work.

The FITS standard also explicitly says that the door is open to lossy compression algorithms for tables if a specific technical problem is solved. So we can try to pursue that further as we move along this road.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that BZERO and BSCALE don't provide a way to subtractive dithering (and after from hearing from @beckermr that this was important in practice), I now don't believe this approach would meet our needs. I've left the discussion of the possibility in, but have reframed it as an incomplete solution.

The bigger problem is that there does not appear to be any implementations of it: there is no mention of it in either the CFITSIO or Astropy documentation (and even if an implementation does exist in, say, the Java ecosystem, we wouldn't be in a position to use it).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, it does appear that the current version of CFITSIO does support this.

On the Java side, which is important to Firefly, nom.tam.fits also claims to support it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears I was wrong about this: as @gpdf discovered, both CFITSIO and at least one major Java FITS implementation (nom.tam.fits) do support table compression (CFITSIO just hasn't updated their docs to reflect this).

But having looked more closely at the table compression spec, I'm less sanguine about introducing our own quantization convention for lossy compression here, for two reasons:

  • Adding BZERO and BSCALE columns as per the Green Bank convention only works if there is only one image column in the binary table, i.e. we put each logical plane in its own binary table HDU (this seems generally true of the Green Bank convention: there's no provision for multiple image columns). This is fine, but I think it removes one of the advantages of the binary table form.

  • There'd be no way to record the "subtractive dithering" quantization (in which one adds a deterministic uniform random number before rounding to int, and subtracts it when decompressing). And while we could invent one (probably a ZDITHER0 column, by analogy with the tiled image compression header key), users would have to use our code to correctly decompress in that case. While other file layouts would require users who don't use our code to do some work, none of them would require them to implement anything as low-level as decompression. @beckermr, do you know if you used subtractive dithering in your lossy-compression configuration? It'd be the ZQUANTIZ header key.

In fact, if we decide to do binary tables with one plane per HDU to work around the first problem, we could just make them the very binary tables used to implement tiled image compression, but add extra columns (which are explicitly permitted) for e.g. a Green Bank convention WCS or other metadata. FITS libraries often provide the option to read a compressed image HDU as an image or as a binary table, and with this format we'd provide all information in a fairly FITS-native way in at least one of those two interpretations. This does mean that the tiled image compression representation would correspond to the exploded image, of course.

Copy link

@beckermr beckermr Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use SUBTRACTIVE_DITHER_2 which preserves zeros exactly. You need dithering to remove biases from the lossy compression and having zeros go through unchanged has lots of nice advantages.

While we've already discussed the fact that we'll probably need to implement some low-level FITS image tile compression code in order to do decompressed subimage reads with object stores anymore, the binary table compression situation is much more problematic:

- tables are much more complicated than images;
- we would have to implement writes ourselves, not just reads;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, as stated elsewhere in the document, we can write with CFITSIO to Posix files, then all we have to implement at first is reading via Astropy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has become a moot point because I was wrong about tiled table compression support in third-party libraries, and rewrote this section as a result. But at the same time I've realized that we do need some amount of low-level write support work in order to do our own quantization, since Astropy does not support that, and the stuff in afw that does it is a mess (dependencies on undocumented and not-obviously-public CFITSIO calls) and not easy to call outside afw C++ code. I've added some words on that in the latest revision.

- we would not have a reference implementation we could use for testing;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have two; having a Java version does help with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

- if the standard has not seen real use, we stand a good chance of discovering uncovered edge cases or other defects;
TallJimbo marked this conversation as resolved.
Show resolved Hide resolved
- third-party FITS readers would definitely not be able to read our files, at least not without significant work.

In fact, even without compression, the binary table layout would require writing our code just to solve the problem of subimage reads over object stores, since Astropy cannot do efficient table subset reads and CFITSIO cannot do object store reads.

Per-HDU Cells
TallJimbo marked this conversation as resolved.
Show resolved Hide resolved
-------------

Another simple file layout is to put each image plane for each cell in a completely separate FITS image HDU.
This is entirely compatible with FITS tile compression (though we'd almost certainly compress the entire HDU as one tile) and our goals for using FITS WCS.
Stitching images from different HDUs into a coherent whole is probably a bit more likely for a third-party FITS viewer to support than images from different binary tables, but a flat list of HDUs for all cells and image planes provides a lot less organizational structure than a binary table (especially a single binary table) for third-party tools to interpret.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, stitching across HDUs is supported by some FITS viewers that obey the not-very-well-standardized NOAO mosaic-focal-plane header conventions. We would want to include those headers. I'm not sure how clear it is how those headers would handle the overlap regions the way we want, but that's an answerable question.

I believe that Firefly does not currently do this, but if we were to choose this representation, obviously we'd put that on the to-do list.


Each HDU comes with an extra 3-9 KB of overhead (1-2 header blocks, and padding out the full HDU size to a multiple of 2880) that cannot be compressed, which is not ideal, but probably not intolerable unless we get unexpectedly good compression ratios or shrink the cell size: an uncompressed 250x250 single-precision floating point image is 250KB, so those overheads should be at most 4% or so.
The overheads would be significant for the PSF images, which we expect to be 25-40 pixels on a side (2.5-6 KB uncompressed).

Subimage reads would be similarly non-ideal but perhaps tolerable.
Because each HDU is so small, it'd be plenty efficient to read full HDUs, but only those for the cells that overlap the region of interest.
Seeking to the right HDUs (or requesting the appropriate byte ranges, in the object store case) is easily solved by putting a table of byte offsets in the primary HDU header, though this isn't something third-party FITS readers could leverage.
That would make for a simple solution to the problem of doing subimage reads over object stores (including compression): we could use the address table to read the HDUs we are interested in in their entirety into a client-side memory location that looks like a full in-memory FITS file holding just those HDUs, and then delegate to CFITSIO's "memory file" interfaces to let it do the decompression.

As in the binary table case, it's an open question whether we could get sufficiently good compression ratios if we are limited to compressing one cell at a time.

Data Cubes
----------

TODO

Exploded Images
---------------

TODO

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MEDS format is basically exploded images but with them all unraveled into a single 1D set of pixels. We then set the lossy compression tile boundaries to correspond to the exploded cells with one per tile.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we store one exploded image per data type being stored (e.g., one image HDU, one mask HDU etc.) and then they all share common metadata stored in a binary table HDU in the file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this format, @beckermr . Do you have a reference?

Copy link
Member

@arunkannawadi arunkannawadi Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/esheldon/meds/wiki/MEDS-Format

I do not know if this is up to date with what has been done in practice though.


Stitched Images
---------------

TODO

Hybrid Options
--------------

TODO


Metadata Layouts
================

Cell coadd metadata falls into three categories:

- global information common to all cells (identifiers, WCS, grid structure);
- per-cell information with a fixed schema (including coadded aperture corrections and wavelength-dependent throughput);
- visit, detector and other IDs for the observations that contribute to each cell.

While some global information will go into FITS headers (certainly the WCS and some identifiers), we do not want to assume that all global metadata can be neatly represented in the limited confines of a FITS header.
A single-row binary table is another option, but we will likely instead adopt the approach recently proposed for other Rubin image data products on RFC-1030: embedding a JSON document as a byte array in a FITS extension HDU.

A binary table with per-cell rows is a natural fit for the fixed-schema per-cell information, especially if the image data layout already involves a binary table with per-cell rows.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would encourage storing the fixed-schema data as a table even if it is also in the JSON.

But if we're embedding a JSON document in the FITS file anyway, it might make more sense to store this information in JSON as well; this will let us share code, documentation, and serialization for more complex objects with other Rubin image data products, and that includes sharing the machinery for managing schema changes schema documentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen performance issues with JSON in FITS?

We do this very rarely in DES and it is a potentially lossy operation, especially with respect to floating point numbers etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do have to be very careful with the floating point numbers (NaNs and Infs especially), but the libraries we're using (mostly Pydantic) seem to be pretty good about that.

The JSON-in-FITS stuff is extremely new and we don't have any benchmarks of that specifically. But I don't think the combination raises any red flags, and we've been pretty happy with JSON serialization even in some very big data structures (not YAML, though, avoid YAML like the plague if you care about performance!).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents: As an end user I would prefer binary table if the metadata has a "flat" (non hierarchical) structure and has FITS compatible data types.

Using binary blobs makes data exploration/browsing more difficult for anyone who is not using DM specific "data loaders" etc.

TallJimbo marked this conversation as resolved.
Show resolved Hide resolved

The tables of observations that contribute to each cell is also a natural binary table, but not one with per-cell rows (it's more natural as a cell-visit-detector join table), but once again embedded JSON is an equally viable option.


See the `Documenteer documentation <https://documenteer.lsst.io/technotes/index.html>`_ for tips on how to write and configure your new technote.
Loading