Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-45371: technote describing possible cell-based coadd file layouts #1
base: main
Are you sure you want to change the base?
DM-45371: technote describing possible cell-based coadd file layouts #1
Changes from 3 commits
33b1857
288f6da
19be450
c7df87e
815f86a
cf8f832
54821c4
a38b189
aebd0d7
3d666b2
2ce0df7
c73cf71
4e773b5
96e31a5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in fact CFITSIO (C) and nom.tam.fits (Java) both support it. It appears that the CFITSIO documentation is behind the code.
Note that CFITSIO was only recently formally released as open-source code on GitHub (there were semi-bootleg versions of it previously) and people may not be entirely familiar with the latest version.
Some links:
https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/fitsio.h#L2069-L2070
https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/imcompress.c#L8704
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently lossy table compression is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be very clear, the WCS of a single cell is simply that of the tract. The layout has only one WCS and each cell is simply some subimage within that WCS. This feature makes the overlap pixels be literally the same geometrically, means that the images can be stitched together correctly, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's conceptually true but if the cells are stored independently of each other in the FITS file, they do technically have separate WCSes and will at least need different CRPIXn values in order to participate in the tract-global WCS correctly.
(Of course the overlap pixels don't have the same values, since there's a different input image selection for each cell.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This position is simply a pixel location and the dimension of the image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Jim is referring to the Rubin convention of having an "alternative WCS" (we usually use WCS
A
) that provides the mapping from local FITS 1-based pixel coordinates to tract-level 0-based coordinates.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add:
"In some cases we might be using corners of the current FITS standard (v4.00) that existing community tools do not yet fully implement. We consider this preferable to using entirely Rubin-unique formats and conventions, and we are prepared to support the community to some extent in filling out these corners (e.g., by adding support to Astropy and Firefly).
We are aware of other projects' interest in FITS representations for large numbers of small-ish images with associated metadata and will keep them informed of what we are planning."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uncomfortable with the parenthetical, as I don't think we'll actually have the effort to add support to at least Astropy on the early-science timescale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are aware, but the FITS tile compression code fpack ends up with a binary table of compressed images. In DES, we get pretty good (lossy) compression with the tiling working per cell. We found pretty apparent issues with lossy compression when the tile compression boundaries crossed cell boundaries because the noise levels in the images change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really makes me wonder why we aren't using HDF5 and making use of a hierarchy and native compression rather than trying to shoe horn this into FITS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, we can also do HDF5, but we really can't just do HDF5, according to our requirements. Maybe there's some scheme where we only store the HDF5 and convert to FITS when streaming bytes in various places, but I'm not at all convinced that removes enough constraints from the FITS version to make it easier to implement overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting perspective. Given that FITS already provides a solution that generates this kind of data structure internally and has all of the support to read subsets etc., it seems like it'd be shoe horning to put this into HDF5. One would have to build all of the logic to interact with the table yourself when something already exists and is a defacto community standard.
It seems like the only issue with simply using what FITS has is that one cannot use object stores / non-posix I/O to read subsets of the file. That seems like the kind of use case that is specific to DM, is solvable with custom code that can be written just for DM's use, and results in reusing essentially everything FITS already offers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough. cc/ @gpdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cells are one parallelization axis, sure. But they're not the only one; parellelizing over objects is another, and we'll have enough data that we can keep all the machines plenty hot parallelizing only over patches. I'd generally say that parallelizing over cells is best only if you have your own reason to do per-cell detection and hence have no choice but to deal with the problem of resolving duplicate detections later. Otherwise it's a lot cleaner to combine data at the pixel level, especially if you're starting from our main Object table's detections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, and "bulk" has implications that we don't want to get into. But services that transform or augment images as they are downloaded is something we've talked about for a long time.
My main concern with only providing FITS via a service is that I think there's still a lot of work defining the format of the FITS file we return. Stitching together the inner cells is not enough to make a coadd look just like a single-epoch image, at least not when you're thinking about the on-disk data model rather than a Python API. I don't want to put a lot of effort into an HDF5 format for cell-based coadds (especially not if I'm also putting a lot of effort into FITS+JSON serialization of other image data products), only to have the scope of what people want back from the FITS-transformer service creep up to near parity with what's in the HDF5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timj wrote:
I think that the binary table model is a reasonable one. It solves one of the biggest problems for interoperable, non-mission-specific interpretation of FITS files with enormous numbers of HDUs: there's nothing in the FITS standard, and precious little in the "registered conventions", for having a "table of contents" extension that tells you what's in the file.
By comparison, putting the subimages in a table solves this problem for free. You can put all sorts of metadata into additional columns in the table, and a reasonable application will give a user the ability to browse by this metadata.
It's true that today if you handed a big Green Bank style table to Firefly or Aladin you might not get acceptable UX, but that's because there's been little pressure on tools in this area to catch up with what's already in the standard.
It's important to recognize that Rubin isn't the only project with this need, and cell-based coadds aren't even the only application for it within Rubin: we also have an equally important requirement to be able to create multi-epoch bulk cutout files, both for cutouts around many objects, and for cutouts around one (possibly moving) object in many epochs.
So we are among friends, as it were, in looking for standards-conformant improvements in tool behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a robust framework for retrieving "related data" in the RSP. When someone queries for coadded images, they'll get a list of images, most naturally, in the way they are chunked in the Butler (e.g., currently, one row per patch in the query result).
The way the image itself is then retrieved actually allows for N different products to be returned associated with that row in the query result. So you can return the full cell-based coadd FITS file verbatim from the backing store, or you can return a stitched-together simple FITS file for the whole area it covers, or you can return a PNG preview of it, or a table of its provenance, or pretty much anything else we can think of.
These additional related products can be things that already exist statically, or things that are created on the fly by services from the data on the backing store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and modern IVOA-flavored tools like Firefly and Aladin already know how to follow these links and present these options to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This should work.
The FITS standard also explicitly says that the door is open to lossy compression algorithms for tables if a specific technical problem is solved. So we can try to pursue that further as we move along this road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that
BZERO
andBSCALE
don't provide a way to subtractive dithering (and after from hearing from @beckermr that this was important in practice), I now don't believe this approach would meet our needs. I've left the discussion of the possibility in, but have reframed it as an incomplete solution.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, it does appear that the current version of CFITSIO does support this.
On the Java side, which is important to Firefly,
nom.tam.fits
also claims to support it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears I was wrong about this: as @gpdf discovered, both CFITSIO and at least one major Java FITS implementation (nom.tam.fits) do support table compression (CFITSIO just hasn't updated their docs to reflect this).
But having looked more closely at the table compression spec, I'm less sanguine about introducing our own quantization convention for lossy compression here, for two reasons:
Adding
BZERO
andBSCALE
columns as per the Green Bank convention only works if there is only one image column in the binary table, i.e. we put each logical plane in its own binary table HDU (this seems generally true of the Green Bank convention: there's no provision for multiple image columns). This is fine, but I think it removes one of the advantages of the binary table form.There'd be no way to record the "subtractive dithering" quantization (in which one adds a deterministic uniform random number before rounding to int, and subtracts it when decompressing). And while we could invent one (probably a
ZDITHER0
column, by analogy with the tiled image compression header key), users would have to use our code to correctly decompress in that case. While other file layouts would require users who don't use our code to do some work, none of them would require them to implement anything as low-level as decompression. @beckermr, do you know if you used subtractive dithering in your lossy-compression configuration? It'd be theZQUANTIZ
header key.In fact, if we decide to do binary tables with one plane per HDU to work around the first problem, we could just make them the very binary tables used to implement tiled image compression, but add extra columns (which are explicitly permitted) for e.g. a Green Bank convention WCS or other metadata. FITS libraries often provide the option to read a compressed image HDU as an image or as a binary table, and with this format we'd provide all information in a fairly FITS-native way in at least one of those two interpretations. This does mean that the tiled image compression representation would correspond to the exploded image, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use SUBTRACTIVE_DITHER_2 which preserves zeros exactly. You need dithering to remove biases from the lossy compression and having zeros go through unchanged has lots of nice advantages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If, as stated elsewhere in the document, we can write with CFITSIO to Posix files, then all we have to implement at first is reading via Astropy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has become a moot point because I was wrong about tiled table compression support in third-party libraries, and rewrote this section as a result. But at the same time I've realized that we do need some amount of low-level write support work in order to do our own quantization, since Astropy does not support that, and the stuff in
afw
that does it is a mess (dependencies on undocumented and not-obviously-public CFITSIO calls) and not easy to call outsideafw
C++ code. I've added some words on that in the latest revision.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we have two; having a Java version does help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, stitching across HDUs is supported by some FITS viewers that obey the not-very-well-standardized NOAO mosaic-focal-plane header conventions. We would want to include those headers. I'm not sure how clear it is how those headers would handle the overlap regions the way we want, but that's an answerable question.
I believe that Firefly does not currently do this, but if we were to choose this representation, obviously we'd put that on the to-do list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MEDS format is basically exploded images but with them all unraveled into a single 1D set of pixels. We then set the lossy compression tile boundaries to correspond to the exploded cells with one per tile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, we store one exploded image per data type being stored (e.g., one image HDU, one mask HDU etc.) and then they all share common metadata stored in a binary table HDU in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this format, @beckermr . Do you have a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/esheldon/meds/wiki/MEDS-Format
I do not know if this is up to date with what has been done in practice though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would encourage storing the fixed-schema data as a table even if it is also in the JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen performance issues with JSON in FITS?
We do this very rarely in DES and it is a potentially lossy operation, especially with respect to floating point numbers etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do have to be very careful with the floating point numbers (NaNs and Infs especially), but the libraries we're using (mostly Pydantic) seem to be pretty good about that.
The JSON-in-FITS stuff is extremely new and we don't have any benchmarks of that specifically. But I don't think the combination raises any red flags, and we've been pretty happy with JSON serialization even in some very big data structures (not YAML, though, avoid YAML like the plague if you care about performance!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents: As an end user I would prefer binary table if the metadata has a "flat" (non hierarchical) structure and has FITS compatible data types.
Using binary blobs makes data exploration/browsing more difficult for anyone who is not using DM specific "data loaders" etc.