-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kerchunk interop #2889
Comments
Sorry for my ignorance, but what is kerchunk? what does it need? |
Regarding pyhdf, I recently looked into it again, and one problem is that is does not expose the chunking information of the underlying file (that's a choice when the implementation was done, see here). Other than that it seems to be working nicely for our needs. |
See #2884 for recent discussion of when dask changes broke our pyhdf wrappers. |
I should have started with this :)
I definitely got that far, associating chunk sets (offset/length/index) and dtype/compression with arrays! The struggle I have, it working out the grouping hierarchy - which dimensions belong to which variable. |
I'm not sure what you mean. Are you allowed to use a library like pyhdf to do the initial parsing? If so then |
Certainly, I just didn't know about it when I got going. I also have an irrational itch to solve it all in one place in pure python, but ... |
This was interesting. Would a pyhdf xarray backend be helpful for satpy even after the fix in #2886? I think this would be helpful for xarray workflows outside of satpy (e.g., https://discourse.pangeo.io/t/reading-modis-thermal-anomalies-into-xarray/4437) and may be able to find some time to work on it.
could this get implemented in pyhdf instead of kerchunk using |
Another thing to consider for parsing, if you're already using NetCDF libraries for parsing NetCDF (.nc) files, you could also use it to parse HDF4 files if the NetCDF C library was compiled with HDF4 support. This is the case for the conda-forge build of netcdf4. |
I for one would love this.
Yes, it absolutely could, but I got lost in the Swig interface and didn't know where to start to add it to pyhdf... |
I definitely share that itch, but I'm always worried about performance... |
I'm sure it can read them, but we want the chunks information rather than the data. For HDF5, it makes data access MUCH faster (depending on access pattern, of course), e.g., https://nbviewer.org/github/cgentemann/cloud_science/blob/master/zarr_meta/cloud_mur_v41_benchmark.ipynb |
Seeking around a remote file to get all those 4-byte definitions will be poor performance no matter how you do it, so it seems to me that extracting what you need once and then reading it one-shot at access time shold be much faster. However, HDF4 does tend to tiny chunks (since the conventions predate the cloud era), so it would mean storing lots of chunks - even though they are probably contiguous in the file. |
Sorry for going off-topic, but that does not apply to real-time processing I suppose? where you just need to read the data once, process it, save the result, and in the end delete the data file? Not that this is a show stopper in anyway, just curious for my main use of data :) |
No. In that case, you may as well grab the whole file locally (in memory) and you end up scanning the bits of metadata just the once, and probably processing every byte of array data. |
FYI I see kerchunk / virtualizarr as a form of caching. You're caching the result of the step where given a so-far-unseen file you have to seek through it to read the metadata and find the positions and lengths of the byte ranges for the chunks inside the file. On local filesystems this step is quick and so there isn't much benefit to caching, but on object storage there is no With virtualizarr's design you can also think of this workflow as caching the result of |
That's a decent way to describe it; but some (C) backend libraries won't read from remote or won't do concurrent/parallel reads at all. It wouldn't be too surprising if, for instance, pyhdf had global state that prevented threaded use; eccodes (for cfgrib) certainly does. |
After poking around the trial dataset (MOD14.A2024226.2345.061.2024227034233.hdf - I don't actually know which variant this is within satpy) with pyhdf, I found that my code actually did find exactly the right set of tags and hierarchy. My question is, when opened with xarray/rasterio/gdal, I get coordinates
which are not arrays in the file, but clearly generated somehow. Is it just pixel centres?? |
The remote and/or parallel reading is a good point. I don't think hdf4 supports remote reading, and I highly doubt any new development on the library would be done to support it. |
I don't really know what gdal does tbh, but the convention is pixel centres yes. That being said, I don't think this is something an hdf4 library should generate, that should be more on the user library side like rioxarray or satpy really. |
I'm not sure I understand this point. Once you have the byte offsets and lengths, you don't need to use any specialized backend libraries to read the data, you just read those byte ranges (e.g. via http range requests to object storage). You can do that with as much parallelism as you like, because object storage doesn't have file locks. |
Exactly true, but if you use one third-party library (like netcdf4), it very_probably acts on a file-like object with seek/read and serial, blocking access. Otherwise, you need another layer to know what to do with those bytes blocks. |
Ref:
Feature Request
I was recently sniped into considering HDF4 as a target for kerchunk, largely because of the amount of NASA archival data in the format. I blindly set about decoding the format and made a decent amount of progress (see linked PR) to the point of seeing the arrays and attributes in a file.
After this, it was mentioned that this repo has a lot of relevant code (along with the pyhdf helper), and indeed! You seem to have solved not only hdf4 but all the peculiarities of many more specific archive conventions.
So, how hard would it be to extract out kerchunk references given that you can already pipe dask chunks into xarray?
cc @maxrjones @TomNicholas
The text was updated successfully, but these errors were encountered: