Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45371: technote describing possible cell-based coadd file layouts #1

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

TallJimbo
Copy link
Member

No description provided.

index.rst Outdated
As a result, I expect us to need to write some low-level code of our own to do subimage reads over object stores, though this is not specific to cell-based coadds: we need this for all of our image data products.
Whether to contribute more general code upstream (probably to Astropy) or focus only on reading the specific files we write ourselves is an important open question; the general solution would be very useful the community, but its scope is unquestionably larger.

Finally, the WCS of any single cell (or the inner-cell stitched image) will be simple and can be exactly represented in FITS headers (unlike the WCSs of our single-epoch images).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be very clear, the WCS of a single cell is simply that of the tract. The layout has only one WCS and each cell is simply some subimage within that WCS. This feature makes the overlap pixels be literally the same geometrically, means that the images can be stitched together correctly, etc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's conceptually true but if the cells are stored independently of each other in the FITS file, they do technically have separate WCSes and will at least need different CRPIXn values in order to participate in the tract-global WCS correctly.

(Of course the overlap pixels don't have the same values, since there's a different input image selection for each cell.)

index.rst Outdated
Whether to contribute more general code upstream (probably to Astropy) or focus only on reading the specific files we write ourselves is an important open question; the general solution would be very useful the community, but its scope is unquestionably larger.

Finally, the WCS of any single cell (or the inner-cell stitched image) will be simple and can be exactly represented in FITS headers (unlike the WCSs of our single-epoch images).
An additional FITS WCS could be used to describe the position of a single-cell image within a larger grid.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This position is simply a pixel location and the dimension of the image.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Jim is referring to the Rubin convention of having an "alternative WCS" (we usually use WCS A) that provides the mapping from local FITS 1-based pixel coordinates to tract-level 0-based coordinates.

index.rst Show resolved Hide resolved

This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.

Binary Table With Per-Cell Rows

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are aware, but the FITS tile compression code fpack ends up with a binary table of compressed images. In DES, we get pretty good (lossy) compression with the tiling working per cell. We found pretty apparent issues with lossy compression when the tile compression boundaries crossed cell boundaries because the noise levels in the images change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really makes me wonder why we aren't using HDF5 and making use of a hierarchy and native compression rather than trying to shoe horn this into FITS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, we can also do HDF5, but we really can't just do HDF5, according to our requirements. Maybe there's some scheme where we only store the HDF5 and convert to FITS when streaming bytes in various places, but I'm not at all convinced that removes enough constraints from the FITS version to make it easier to implement overall.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really makes me wonder why we aren't using HDF5 and making use of a hierarchy and native compression rather than trying to shoe horn this into FITS.

This is an interesting perspective. Given that FITS already provides a solution that generates this kind of data structure internally and has all of the support to read subsets etc., it seems like it'd be shoe horning to put this into HDF5. One would have to build all of the logic to interact with the table yourself when something already exists and is a defacto community standard.

It seems like the only issue with simply using what FITS has is that one cannot use object stores / non-posix I/O to read subsets of the file. That seems like the kind of use case that is specific to DM, is solvable with custom code that can be written just for DM's use, and results in reusing essentially everything FITS already offers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough. cc/ @gpdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cells are one parallelization axis, sure. But they're not the only one; parellelizing over objects is another, and we'll have enough data that we can keep all the machines plenty hot parallelizing only over patches. I'd generally say that parallelizing over cells is best only if you have your own reason to do per-cell detection and hence have no choice but to deal with the problem of resolving duplicate detections later. Otherwise it's a lot cleaner to combine data at the pixel level, especially if you're starting from our main Object table's detections.

Copy link
Member Author

@TallJimbo TallJimbo Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a service to do a bulk data transfer?

We don't, and "bulk" has implications that we don't want to get into. But services that transform or augment images as they are downloaded is something we've talked about for a long time.

My main concern with only providing FITS via a service is that I think there's still a lot of work defining the format of the FITS file we return. Stitching together the inner cells is not enough to make a coadd look just like a single-epoch image, at least not when you're thinking about the on-disk data model rather than a Python API. I don't want to put a lot of effort into an HDF5 format for cell-based coadds (especially not if I'm also putting a lot of effort into FITS+JSON serialization of other image data products), only to have the scope of what people want back from the FITS-transformer service creep up to near parity with what's in the HDF5.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj wrote:

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

I think that the binary table model is a reasonable one. It solves one of the biggest problems for interoperable, non-mission-specific interpretation of FITS files with enormous numbers of HDUs: there's nothing in the FITS standard, and precious little in the "registered conventions", for having a "table of contents" extension that tells you what's in the file.

By comparison, putting the subimages in a table solves this problem for free. You can put all sorts of metadata into additional columns in the table, and a reasonable application will give a user the ability to browse by this metadata.

It's true that today if you handed a big Green Bank style table to Firefly or Aladin you might not get acceptable UX, but that's because there's been little pressure on tools in this area to catch up with what's already in the standard.

It's important to recognize that Rubin isn't the only project with this need, and cell-based coadds aren't even the only application for it within Rubin: we also have an equally important requirement to be able to create multi-epoch bulk cutout files, both for cutouts around many objects, and for cutouts around one (possibly moving) object in many epochs.

So we are among friends, as it were, in looking for standards-conformant improvements in tool behavior.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

Interesting. Do we have a service to do a bulk data transfer? Such a conversion could happen in the process, since we don't want to store the trimmed/stitched image at USDF and the destination may not have Science Pipelines installed to read our custom datastructure.

We have a robust framework for retrieving "related data" in the RSP. When someone queries for coadded images, they'll get a list of images, most naturally, in the way they are chunked in the Butler (e.g., currently, one row per patch in the query result).

The way the image itself is then retrieved actually allows for N different products to be returned associated with that row in the query result. So you can return the full cell-based coadd FITS file verbatim from the backing store, or you can return a stitched-together simple FITS file for the whole area it covers, or you can return a PNG preview of it, or a table of its provenance, or pretty much anything else we can think of.

These additional related products can be things that already exist statically, or things that are created on the fly by services from the data on the backing store.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and modern IVOA-flavored tools like Firefly and Aladin already know how to follow these links and present these options to users.

index.rst Outdated
Exploded Images
---------------

TODO

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MEDS format is basically exploded images but with them all unraveled into a single 1D set of pixels. We then set the lossy compression tile boundaries to correspond to the exploded cells with one per tile.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we store one exploded image per data type being stored (e.g., one image HDU, one mask HDU etc.) and then they all share common metadata stored in a binary table HDU in the file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this format, @beckermr . Do you have a reference?

Copy link
Member

@arunkannawadi arunkannawadi Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/esheldon/meds/wiki/MEDS-Format

I do not know if this is up to date with what has been done in practice though.

index.rst Outdated
Comment on lines 146 to 147
A binary table with per-cell rows is a natural fit for the fixed-schema per-cell information, especially if the image data layout already involves a binary table with per-cell rows.
But if we're embedding a JSON document in the FITS file anyway, it might make more sense to store this information in JSON as well; this will let us share code, documentation, and serialization for more complex objects with other Rubin image data products, and that includes sharing the machinery for managing schema changes schema documentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen performance issues with JSON in FITS?

We do this very rarely in DES and it is a potentially lossy operation, especially with respect to floating point numbers etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do have to be very careful with the floating point numbers (NaNs and Infs especially), but the libraries we're using (mostly Pydantic) seem to be pretty good about that.

The JSON-in-FITS stuff is extremely new and we don't have any benchmarks of that specifically. But I don't think the combination raises any red flags, and we've been pretty happy with JSON serialization even in some very big data structures (not YAML, though, avoid YAML like the plague if you care about performance!).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents: As an end user I would prefer binary table if the metadata has a "flat" (non hierarchical) structure and has FITS compatible data types.

Using binary blobs makes data exploration/browsing more difficult for anyone who is not using DM specific "data loaders" etc.

@TallJimbo TallJimbo marked this pull request as ready for review July 22, 2024 20:52
index.rst Outdated Show resolved Hide resolved
index.rst Outdated Show resolved Hide resolved
This avoids the problem with per-HDU overheads, and it makes an address table much less important, as there are many fewer HDUs.
It also allows compression tiles that comprise multiple cells.

It does not allow us to represent the on-sky locations of cells using FITS WCS, however, and this is probably enough to rule it out.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the integer offsets that indicate the position of the cell?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, more or less. We could have a WCS that describes one cell, but I don't think there's any way to write a FITS WCS that puts different slices of a data cube in different places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think FITS-WCS can do it. I'm pretty sure AST can with switchMaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea if gwcs can.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there might be a way to do this with table-lookup WCSes, but they are not uniformly well supported by readers. I would have to read the FITS standard very closely to determine whether this would also require us to do all the spherical projection ourselves, which would require us to be careful about precision issues in the (piecewise linear) table lookups, especially near the coordinate poles.

I am not endorsing this idea! But if this storage scheme were the winner for other reasons, it's something we could research.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This storage scheme definitely isn't winning for other reasons, so I don't think it's worth anyone's time to look into it further.

index.rst Show resolved Hide resolved
Co-authored-by: Arun Kannawadi  <[email protected]>
@gpdf gpdf self-requested a review August 15, 2024 22:56
index.rst Outdated
The main problem with binary table layouts is compression support.
The latest FITS standard does include (in section 10.3) a discussion of "tiled table compression", which seems like it'd be sufficient, as long as compressing one cell at a time is enough to get a good compression ratio (this is unclear).
Unlike image tile compression, binary table tile compression doesn't support lossy compression algorithms or dithered quantization, but it would still be possible to do our own non-dithered quantization and use ``BZERO`` and ``BSCALE`` to record the transformation from integer back to floating-point.
The bigger problem is that there does not appear to be any implementations of it: there is no mention of it in either the CFITSIO or Astropy documentation (and even if an implementation does exist in, say, the Java ecosystem, we wouldn't be in a position to use it).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears I was wrong about this: as @gpdf discovered, both CFITSIO and at least one major Java FITS implementation (nom.tam.fits) do support table compression (CFITSIO just hasn't updated their docs to reflect this).

But having looked more closely at the table compression spec, I'm less sanguine about introducing our own quantization convention for lossy compression here, for two reasons:

  • Adding BZERO and BSCALE columns as per the Green Bank convention only works if there is only one image column in the binary table, i.e. we put each logical plane in its own binary table HDU (this seems generally true of the Green Bank convention: there's no provision for multiple image columns). This is fine, but I think it removes one of the advantages of the binary table form.

  • There'd be no way to record the "subtractive dithering" quantization (in which one adds a deterministic uniform random number before rounding to int, and subtracts it when decompressing). And while we could invent one (probably a ZDITHER0 column, by analogy with the tiled image compression header key), users would have to use our code to correctly decompress in that case. While other file layouts would require users who don't use our code to do some work, none of them would require them to implement anything as low-level as decompression. @beckermr, do you know if you used subtractive dithering in your lossy-compression configuration? It'd be the ZQUANTIZ header key.

In fact, if we decide to do binary tables with one plane per HDU to work around the first problem, we could just make them the very binary tables used to implement tiled image compression, but add extra columns (which are explicitly permitted) for e.g. a Green Bank convention WCS or other metadata. FITS libraries often provide the option to read a compressed image HDU as an image or as a binary table, and with this format we'd provide all information in a fairly FITS-native way in at least one of those two interpretations. This does mean that the tiled image compression representation would correspond to the exploded image, of course.

Copy link

@beckermr beckermr Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use SUBTRACTIVE_DITHER_2 which preserves zeros exactly. You need dithering to remove biases from the lossy compression and having zeros go through unchanged has lots of nice advantages.

Copy link

@gpdf gpdf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great summary of the issues. My main contribution is to note that table-compression implementations are available.

I've made a number of other comments. I'll write separately on the RFC about the proposed decision itself, which isn't addressed in this PR.

index.rst Outdated
- Astropy can do image tile compression with efficient subimage reads on POSIX filesystems, and it can do uncompressed reads (including efficient subimage reads) against both POSIX and object stores via `fsspec`, but it delegates to vendored CFITSIO code for its tile compression code and does not support the combination of these two.
Astropy cannot do subset reads of FITS binary tables, even on POSIX filesystems.

- I am not aware of any library that can do FITS binary table compression at all.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in fact CFITSIO (C) and nom.tam.fits (Java) both support it. It appears that the CFITSIO documentation is behind the code.

Note that CFITSIO was only recently formally released as open-source code on GitHub (there were semi-bootleg versions of it previously) and people may not be entirely familiar with the latest version.

Some links:

https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/fitsio.h#L2069-L2070

https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/imcompress.c#L8704

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently lossy table compression is not supported.

The permitted algorithms for compressing BINTABLE columns are ’RICE 1’, ’GZIP 1’, and ’GZIP 2’ (plus ’NOCOMPRESS’), which are lossless and are described in Sect. 10.4. Lossy compression could be allowed in the future once a process is defined to preserve the details of the compression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.

Binary Table With Per-Cell Rows
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj wrote:

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

I think that the binary table model is a reasonable one. It solves one of the biggest problems for interoperable, non-mission-specific interpretation of FITS files with enormous numbers of HDUs: there's nothing in the FITS standard, and precious little in the "registered conventions", for having a "table of contents" extension that tells you what's in the file.

By comparison, putting the subimages in a table solves this problem for free. You can put all sorts of metadata into additional columns in the table, and a reasonable application will give a user the ability to browse by this metadata.

It's true that today if you handed a big Green Bank style table to Firefly or Aladin you might not get acceptable UX, but that's because there's been little pressure on tools in this area to catch up with what's already in the standard.

It's important to recognize that Rubin isn't the only project with this need, and cell-based coadds aren't even the only application for it within Rubin: we also have an equally important requirement to be able to create multi-epoch bulk cutout files, both for cutouts around many objects, and for cutouts around one (possibly moving) object in many epochs.

So we are among friends, as it were, in looking for standards-conformant improvements in tool behavior.


This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.

Binary Table With Per-Cell Rows
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

Interesting. Do we have a service to do a bulk data transfer? Such a conversion could happen in the process, since we don't want to store the trimmed/stitched image at USDF and the destination may not have Science Pipelines installed to read our custom datastructure.

We have a robust framework for retrieving "related data" in the RSP. When someone queries for coadded images, they'll get a list of images, most naturally, in the way they are chunked in the Butler (e.g., currently, one row per patch in the query result).

The way the image itself is then retrieved actually allows for N different products to be returned associated with that row in the query result. So you can return the full cell-based coadd FITS file verbatim from the backing store, or you can return a stitched-together simple FITS file for the whole area it covers, or you can return a PNG preview of it, or a table of its provenance, or pretty much anything else we can think of.

These additional related products can be things that already exist statically, or things that are created on the fly by services from the data on the backing store.


This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.

Binary Table With Per-Cell Rows
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and modern IVOA-flavored tools like Firefly and Aladin already know how to follow these links and present these options to users.

index.rst Show resolved Hide resolved
index.rst Show resolved Hide resolved
index.rst Outdated
Stitched Images
---------------

We expect most accesses to our coadd files to be uninterested in the redundant overlap pixel values - instead, most science users will be interested in stitching together the inner cells to form a mostly-seamless patch-level image.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "many", perhaps, not "most", given the discussions of the value of the images for algorithmic processing vs. human inspection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made that change and added my point on forward-modeling vs. data-convolution analysis needs from the other thread.


Finally, the WCS of any single cell (or the inner-cell stitched image; they differ only in an integer offset) will be simple and can be exactly represented in FITS headers (unlike the WCSs of our single-epoch images).
An additional FITS WCS could be used to describe the position of a single-cell image within a larger grid (this is just an integer offset).
It is highly desirable that any images we store in our FITS files have both to allow third-party FITS readers to fully interpret them.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:
"In some cases we might be using corners of the current FITS standard (v4.00) that existing community tools do not yet fully implement. We consider this preferable to using entirely Rubin-unique formats and conventions, and we are prepared to support the community to some extent in filling out these corners (e.g., by adding support to Astropy and Firefly).

We are aware of other projects' interest in FITS representations for large numbers of small-ish images with associated metadata and will keep them informed of what we are planning."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little uncomfortable with the parenthetical, as I don't think we'll actually have the effort to add support to at least Astropy on the early-science timescale.

index.rst Outdated Show resolved Hide resolved
While some global information will go into FITS headers (certainly the WCS and some identifiers), we do not want to assume that all global metadata can be neatly represented in the limited confines of a FITS header.
A single-row binary table is another option, but we will likely instead adopt the approach recently proposed for other Rubin image data products on RFC-1030: embedding a JSON document as a byte array in a FITS extension HDU.

A binary table with per-cell rows is a natural fit for the fixed-schema per-cell information, especially if the image data layout already involves a binary table with per-cell rows.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would encourage storing the fixed-schema data as a table even if it is also in the JSON.

- tiled table compression is supported by CFITSIO and other readers;

- we do need lossy compression and subtractive dithering for at least
  some image planes.

The overall result is that the big picture is not changed much: binary
tables are still not really viable, because they can't do the kind of
compression we need.
Comment on lines 52 to 53
CFITSIO can only permits customization of the quantization used in image tile compression via undocumented internal interfaces (which is what we use in ``afw``).
This imposes the same limitation on its Python bindings in the ``fitsio`` package.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not totally true, though I agree the documentation is sparse.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just added support for some of the lossy compression seeding options to fitsio: esheldon/fitsio#411.

Excerpts from the docs are here in this github message: esheldon/fitsio#411 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants