Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HDF4? #216

Open
TomNicholas opened this issue Aug 7, 2024 · 29 comments
Open

Support HDF4? #216

TomNicholas opened this issue Aug 7, 2024 · 29 comments
Labels
enhancement New feature or request references generation Reading byte ranges from archival files

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Aug 7, 2024

Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in #85 (comment) @jgallagher59701 mentioned that DMR++ can (or soon will) support it.

I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code.

If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests (see #85), then presumably a reader for HDF4 directly to chunk manifests would also be possible?

cc @ayushnag @betolink

@TomNicholas TomNicholas added enhancement New feature or request references generation Reading byte ranges from archival files labels Aug 7, 2024
@jgallagher59701
Copy link

jgallagher59701 commented Aug 8, 2024 via email

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 20, 2024

@martindurant has an in-progress PR to kerchunk to add support for reading HDF4 directly. If that makes it in we can just call it from vz.open_virtual_dataset, which would fully close this issue.

@martindurant
Copy link
Member

I should warn you, that I am working to match only specific NASA data (provided by @maxrjones ), not HDF4 in general, and I suspect that the chunks in general may be tiny.

@jgallagher59701
Copy link

Older data in HDF4/5 almost always has small chunks (spinning disks, low-latency, small block sizes). But that is not a big problem. Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel. We call these grouped chunks 'Super Chunks.' It is an optimization that Patrick Quinn first implemented and we stumbled on later. This is far more efficient than transferring the small chunks in parallel (in general, exceptions exist).

@martindurant
Copy link
Member

Yes, kerchunk also joins near-contiguous chunks; the problem I actually see

  • the large number of references means relatively big reference stores
  • relatively small gains for reading only select chunks compared to grabbing the whole file every time.

@TomNicholas
Copy link
Member Author

Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel.

Yes, kerchunk also joins near-contiguous chunks

This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?

@jgallagher59701
Copy link

Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel.
Yes, kerchunk also joins near-contiguous chunks

This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?

I mean concatenating the byte ranges. Often in these files the chunks lie right next to each other (for a given array).

@jgallagher59701
Copy link

Yes, kerchunk also joins near-contiguous chunks; the problem I actually see
...

  • relatively small gains for reading only select chunks compared to grabbing the whole file every time.

That's true for files with a small number of variables. Get the whole file. If there are O(10^2) variables and only 2-3 are needed, it's faster to get just those 2-3. Again, there are exceptions.

@martindurant
Copy link
Member

In ReferenceFS, if you cat() with a number of references, those within a single file may be merged depending on the arguments

        max_gap=64_000,
        max_block=256_000_000,

For example, for references [remote://file, 10, 10] , [remote://file, 30, 10], the actual request will be bytes 10->40, if the gap is smaller than max_gap. The result is sliced into two outputs.
Naturally, if max_gap=0, only truly contiguous parts are merged, and <0 for no merge at all. The requests would still be concurrent, however.

@mdsumner
Copy link
Contributor

mdsumner commented Aug 20, 2024

That said, we have a full interpreter in C++, which i realize is not exactly enticing for many

Why aren't we using DMR++? Is it not in good enough shape to bind to Python/R? Is there other challenges, there's plenty of C++ used seamlessly in Python and calling out to h5 libs is doing that anyway.

That sounds like the crosslang solution already ?? I only have a few HDF4 stores of interest outside of NASA, and maybe only one.

@mdsumner
Copy link
Contributor

There's something I'm missing given #113 🙏 I'll keep exploring I keep finding new aspects 👌.

@martindurant
Copy link
Member

I'm sorry if I have done some duplication of work. I think it may be worthwhile to have a pure-python solution too, though, for the case that no dmr++ index files exist for some HDF4. Also, it has been (so far) nerdy fun, definitely work a blog post.

@maxrjones
Copy link
Member

https://github.com/fhs/pyhdf/ also reads HDF4 and SatPy uses it to read MODIS. I'm wondering if it could be helpful for Kerchunk as well.

@jgallagher59701
Copy link

I wonder if Ayush'd work on VirtualiZarr has a DMR++ parser (pure python) you could use? The DMR++ builder is C++ but we actually have a DMR++ Builder web service that we can expose for HDF5 and could do the same thing for HDF4.

It would be interesting to see how close we could get to valid Kerchunk from DMR++ using a simple transform. Just a thought, I don't see myself having time for that any time soon...

@ayushnag
Copy link
Contributor

My code mostly extracts the necessary zarr metadata and then creates it into a virtualizarr data structure at the end of each function. So by just modifying the last step creating a kerchunk reader is definitely possible. Also interestingly you could go dmrpp --> virtualizarr --> kerchunk since virtualizarr supports writing out to kerchunk.

However I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4

@martindurant
Copy link
Member

I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4

Is there no hdf4 work? It is very different.

@ayushnag
Copy link
Contributor

No there isn't any hdf4 work yet. However it seems like the goal is to make the hdf4 dmrpp spec very similar to the hdf5 one which means it will require some sort of extension (as opposed to a rewrite) as James mentioned above:

Bottom line, you will probably have to extend the interpreter you have

@martindurant
Copy link
Member

My HDF4 branch in kerchunk is very nearly complete. Everyone welcome to look!

As for pyhdf4..., to use it, you need to have a very deep understanding of the specifics of the conventions used in a given file (maybe possible for modis) and how the C API works. If I can make my version work, I prefer pure-python.

@betolink
Copy link

betolink commented Aug 22, 2024

Is this code?: https://github.com/martindurant/fsspec-reference-maker/blob/df61060869e367da9674d33962631d81ead76865/kerchunk/hdf.py#L697 seeing terms like "SDD" gave me flashbacks of the first time I opened one of these files. Thanks for all the work! can we just throw some examples at it?

@martindurant
Copy link
Member

Yes, that code. Please do play with it, but of course there are no guarantees.

@TomNicholas
Copy link
Member Author

Looks like @martindurant 's kerchunk HDF4 reader is in kerchunk main - it's in the docs here, though perhaps not yet in a released version of kerchunk?

This means that someone could easily use it to add a VirtualiZarr HDF4 reader.

@martindurant
Copy link
Member

martindurant commented Nov 15, 2024

Correct, I will do a kerchunk release today.

-edit-

done

@TomNicholas
Copy link
Member Author

Thanks @martindurant !

Does someone have a small example HDF4 file we could use in VirtualiZarr's tests? It doesn't look like either of the PRs ((1), (2)) adding the HDF4 reader to kerchunk contain any tests...

@mdsumner
Copy link
Contributor

Maybe this one

https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/hdf4

But I'll look through our archives there'll be something 🙏

@mdsumner
Copy link
Contributor

@martindurant
Copy link
Member

The specific files used for development were behind a NASA signup and accept-conditions, so I don't think we can include them for tests here.

In addition, we don't really have a baseline expectation of what the output ought to look like - loading with hdf4 for xarray requires a choice of "variable" and don't include the whole of the original datafile's contents.

@TomNicholas
Copy link
Member Author

In addition, we don't really have a baseline expectation of what the output ought to look like - loading with hdf4 for xarray requires a choice of "variable" and don't include the whole of the original datafile's contents.

Okay - in that case it would be good to better understand the relationship between the HDF4 data model and the xarray data model when creating this reader, otherwise we're going to end up with confusions similar to the tiff case (see #291 (comment)).

Again it would be great if someone who actually uses HDF4 wanted to have a go at a PR for this.

@martindurant
Copy link
Member

You may not find it satisfactory, but I think processing datasets that possibly don't fit neatly into xarray's model should be considered somewhat expert: the user needs to know some details of their data and how they expect it to turn out in a zarr form. That would include needing in some cases to specify if a thing is a array, array with coordinates, dataset or tree.

@jgallagher59701
Copy link

Hi, You might be interested in some of the work we're doing at the behest of NASA WRT HDF4 and HDF-EOS2. The DMR++ encoding and our interpreter now supports HDF4/EOS2 at least as far as NASA has taken it (like HDF5, there's quite a bit to the HDF4 data model, as I'm sure you know).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

7 participants