Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinates lost with GPM-IMERG file #342

Open
rabernat opened this issue Dec 12, 2024 · 6 comments
Open

Coordinates lost with GPM-IMERG file #342

rabernat opened this issue Dec 12, 2024 · 6 comments
Labels
bug Something isn't working CF conventions Kerchunk Relating to the kerchunk library / specification itself readers

Comments

@rabernat
Copy link
Collaborator

I'm working with the GPM-IMERG files from NASA. Here's an example

import xarray as xr
import fsspec

url = "https://earthmover-sample-data.s3.us-east-1.amazonaws.com/hdf5/3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5"
ds_nc = xr.open_dataset(fsspec.open(url).open(), engine="h5netcdf", group="Grid", decode_coords="all")
print(ds)
<xarray.Dataset> Size: 104MB
Dimensions:                         (time: 1, lon: 3600, lat: 1800, nv: 2,
                                     lonv: 2, latv: 2)
Coordinates:
  * time                            (time) object 8B 1998-01-01 00:00:00
  * lon                             (lon) float32 14kB -179.9 -179.9 ... 179.9
  * lat                             (lat) float32 7kB -89.95 -89.85 ... 89.95
    time_bnds                       (time, nv) object 16B ...
    lon_bnds                        (lon, lonv) float32 29kB ...
    lat_bnds                        (lat, latv) float32 14kB ...
Dimensions without coordinates: nv, lonv, latv
Data variables:
    precipitation                   (time, lon, lat) float32 26MB ...
    randomError                     (time, lon, lat) float32 26MB ...
    probabilityLiquidPrecipitation  (time, lon, lat) float32 26MB ...
    precipitationQualityIndex       (time, lon, lat) float32 26MB ...
Attributes:
    GridHeader:  BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeRe...

When opening this dataset virtually, many of the coordinates are lost.

from virtualizarr import open_virtual_dataset

dsv = open_virtual_dataset(url, indexes={}, group="Grid")
print(dsv)

This spits out the following warning

[/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py:547](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py#line=546): UserWarning: The following excepion was caught and quashed while traversing HDF5
Can't get fill value (fill value is undefined)
Traceback (most recent call last):
  File "[/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py", line 363](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py#line=362), in _translator
    fill = h5obj.fillvalue
           ^^^^^^^^^^^^^^^
  File "h5py[/_objects.pyx", line 54](https://hub.openveda.cloud/_objects.pyx#line=53), in h5py._objects.with_phil.wrapper
  File "h5py[/_objects.pyx", line 55](https://hub.openveda.cloud/_objects.pyx#line=54), in h5py._objects.with_phil.wrapper
  File "[/srv/conda/envs/notebook/lib/python3.12/site-packages/h5py/_hl/dataset.py", line 622](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/h5py/_hl/dataset.py#line=621), in fillvalue
    self._dcpl.get_fill_value(arr)
  File "h5py[/_objects.pyx", line 54](https://hub.openveda.cloud/_objects.pyx#line=53), in h5py._objects.with_phil.wrapper
RuntimeError: Can't get fill value (fill value is undefined)

  warnings.warn(msg)

and returns a dataset with most of the coordinates missing

<xarray.Dataset> Size: 91MB
Dimensions:                         (time: 1, lon: 3600, lat: 1800, nv: 2)
Coordinates:
    time                            (time) int32 4B ManifestArray<shape=(1,),...
Dimensions without coordinates: lon, lat, nv
Data variables:
    precipitation                   (time, lon, lat) float32 26MB ManifestArr...
    precipitationQualityIndex       (time, lon, lat) float32 26MB ManifestArr...
    probabilityLiquidPrecipitation  (time, lon, lat) int16 13MB ManifestArray...
    randomError                     (time, lon, lat) float32 26MB ManifestArr...
    time_bnds                       (time, nv) int32 8B ManifestArray<shape=(...
Attributes:
    GridHeader:   BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeR...
    coordinates:  time

I also tried with the new kerchunk-free backend and got an error

from virtualizarr.readers.hdf import HDFVirtualBackend

open_virtual_dataset(url, indexes={}, group="Grid", drop_variables=["Intermediate"], backend=HDFVirtualBackend)
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf/hdf.py:129](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf/hdf.py#line=128), in HDFVirtualBackend._dataset_chunk_manifest(path, dataset)
    127 num_chunks = dsid.get_num_chunks()
    128 if num_chunks == 0:
--> 129     raise ValueError("The dataset is chunked but contains no chunks")
    130 shape = tuple(
    131     math.ceil(a [/](https://hub.openveda.cloud/) b) for a, b in zip(dataset.shape, dataset.chunks)
    132 )
    133 paths = np.empty(shape, dtype=np.dtypes.StringDType)  # type: ignore

ValueError: The dataset is chunked but contains no chunks
@TomNicholas
Copy link
Member

TomNicholas commented Dec 12, 2024

RuntimeError: Can't get fill value (fill value is undefined)

That's weird but it's an upstream issue in kerchunk. EDIT: Possibly related to fsspec/kerchunk#177

returns a dataset with most of the coordinates missing

This is almost certainly another symptom of the fact that VirtualiZarr does not yet understand how to decode according to CF conventions in the same way that xarray does. Some of the those variables are present, they are just set as data variables instead of coordinates - that's the exact same bug as in issue #189. Other coordinates (lon_bnds & lat_bnds) are presumably created by xarray from metadata, rather then them actually existing in the file.

I have a in-progress PR to actually call xarray's internal CF decoding logic from within VirtualiZarr in #224.

I also tried with the new kerchunk-free backend and got an error

This rings a bell but I defer to @sharkinsspatial

@TomNicholas TomNicholas added bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself CF conventions readers labels Dec 12, 2024
@TomNicholas
Copy link
Member

Also we should probably have some more tests that explicitly check that all the same variables and coordinates are present for a set of different example real-world files, whether opened with virtualizarr or xarray.

@rabernat
Copy link
Collaborator Author

This is almost certainly another symptom of the fact that VirtualiZarr does not yet understand how to decode according to CF conventions in the same way that xarray does.... Some of the those variables are present, they are just set as data variables instead of coordinates

I don't think that's the case here. The following variables are present in the underlying HDF5 dataset but don't appear anywhere in the virtualized dataset, either as coordinates or data variables: nv, lonv, latv, lon, lat, time_bnds, lon_bnds, lat_bnds. Here's a dump of the HDF5 variables, their chunk shapes, and the number of initialized chunks.

nv (2,) 0
lonv (2,) 0
latv (2,) 0
time (32,) 1
lon (3600,) 1
lat (1800,) 1
time_bnds (32, 2) 1
lon_bnds (3600, 2) 1
lat_bnds (1800, 2) 1
precipitation (1, 145, 1800) 25
randomError (1, 145, 1800) 25
probabilityLiquidPrecipitation (1, 291, 1800) 13
precipitationQualityIndex (1, 145, 1800) 25

The variables are just being skipped completely.

@ayushnag
Copy link
Contributor

This actually seems to be an error in kerchunk? I made a kerchunk JSON of the file and the missing variables (lon, lat, lon_bnds, lat_bnds) do not even have a corresponding .zarray in the JSON.

from kerchunk.hdf import SingleHdf5ToZarr
import ujson
m = SingleHdf5ToZarr('3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5')
m2 = m.translate()
with open("3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5.json", "w") as f:
    f.write(ujson.dumps(m2))

print(xr.open_dataset("3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5.json", engine="kerchunk", group="Grid"))
<xarray.Dataset> Size: 104MB
Dimensions:                         (Grid/time: 1, Grid/lon: 3600,
                                     Grid/lat: 1800, Grid/nv: 2)
Coordinates:
    time                            (Grid/time) object 8B ...
Dimensions without coordinates: Grid/time, Grid/lon, Grid/lat, Grid/nv
Data variables:
    precipitation                   (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
    precipitationQualityIndex       (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
    probabilityLiquidPrecipitation  (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
    randomError                     (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
    time_bnds                       (Grid/time, Grid/nv) object 16B ...
Attributes:
    GridHeader:  BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeRe...

@TomNicholas
Copy link
Member

I made a kerchunk JSON of the file and the missing variables (lon, lat, lon_bnds, lat_bnds) do not even have a corresponding .zarray in the JSON.

That definitely sounds wrong, and would throw virtualizarr's kerchunk translator off completely.

But also this Grid/time should not happen either, it should be time. The group name should never appear in the dimension name like that. So even fsspec isn't reading these kerchunk references back correctly.

@rabernat
Copy link
Collaborator Author

FWIW we got this file working fairly well with @sharkinsspatial's HDF-native reader: https://github.com/earth-mover/icechunk-nasa/blob/main/notebooks/write_virtual.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CF conventions Kerchunk Relating to the kerchunk library / specification itself readers
Projects
None yet
Development

No branches or pull requests

3 participants