Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slicing not supported with Indexer when trying to open many MUR files. #360

Open
betolink opened this issue Dec 20, 2024 · 5 comments
Open
Labels
usage example Real world use case examples

Comments

@betolink
Copy link

betolink commented Dec 20, 2024

xref: nsidc/earthaccess#903

I'm trying to open 10 years of high resolution MUR data using earthaccess, the upcoming example for 1 month (32 granules) works fine but when I try to open ~1000 I get the following error:

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:1031, in apply_indexer(indexable, indexer)
   1029     return indexable.vindex[indexer]
   1030 elif isinstance(indexer, OuterIndexer):
-> 1031     return indexable.oindex[indexer]
   1032 else:
   1033     return indexable[indexer]

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:369](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=368), in IndexCallable.__getitem__(self, key)
    368 def __getitem__(self, key: Any) -> Any:
--> 369     return self.getter(key)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:1508](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=1507), in NumpyIndexingAdapter._oindex_get(self, indexer)
   1506 def _oindex_get(self, indexer: OuterIndexer):
   1507     key = _outer_to_numpy_indexer(indexer, self.array.shape)
-> 1508     return self.array[key]

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py:214](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py#line=213), in ManifestArray.__getitem__(self, key)
    212     return self
    213 else:
--> 214     raise NotImplementedError(f"Doesn't support slicing with {indexer}")

NotImplementedError: Doesn't support slicing with (array([....]))

I'm not sure if there is something wrong with the data (misaligned grid in one granule or something like that). @TomNicholas mentioned that this may be related to #51

The full example(earthaccess has to be installed from source):

# NASA JPL Multiscale Ultrahigh Resolution (MUR) Sea Surface Temperature (SST) dataset - 0.01 degree resolution
results = []
for year in range(2010, 2020):
    seasonal_granules = earthaccess.search_data(
        temporal=(f"{year}-12", f"{year+1}-02"), short_name="MUR-JPL-L4-GLOB-v4.1",
        cloud_hosted=True
    )
    results.extend(seasonal_granules)
len(results)
913

# and then 

%%time
mur = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    load=True, # it also failed with load=False
    parallel=True,
    concat_dim="time",
    coords="minimal",
    compat="override",
    combine_attrs="drop_conflicts",
)
mur
@betolink betolink changed the title Indexer not supported error when trying to use MUR Slicing not supported with Indexer when trying to open many MUR files. Dec 20, 2024
@TomNicholas TomNicholas added the usage example Real world use case examples label Dec 20, 2024
@TomNicholas
Copy link
Member

I'm not sure why this operation requires indexing at all...

Can you open the files individually and then concatenate them to try to narrow down where / for which dataset the error is occurring?

@ayushnag
Copy link
Contributor

ayushnag commented Dec 20, 2024

So I did that approach and narrowed it down a bit. This code snippet uses the same results variable linked above by @betolink and reaches the same error with just two xr.Datasets. The main difference I found between the two datasets is that one has an extra variable (dt_1km_data) which is also shown below. I guess at some point the MUR netcdf product added a variable to the datasets? This presumably makes it difficult to concatenate with older references.

Also FYI I checked the source netcdfs and verified that one does have an extra variable to make sure this isn't a dmrpp reader error

import dask
import virtualizarr as vz
fs = earthaccess.get_s3_filesystem(results=results)
fs.storage_options["anon"] = False  # type: ignore
open_ = dask.delayed(vz.open_virtual_dataset)  # type: ignore
vdatasets = []
# Get list of virtual datasets (or dask delayed objects)
for g in results:
    vdatasets.append(
        open_(
            filepath=g.data_links(access="direct")[0] + ".dmrpp",
            filetype="dmrpp",  # type: ignore
            indexes={},
            reader_options={"storage_options": fs.storage_options},  # type: ignore
        )
    )
vdatasets = dask.compute(vdatasets)[0]
xr.combine_nested(vdatasets[495:497], concat_dim="time",
    coords="minimal",
    compat="override",
    combine_attrs="drop_conflicts")
print(vdatasets[495])
print(vdatasets[496])
<xarray.Dataset> Size: 5GB
Dimensions:           (time: 1, lat: 17999, lon: 36000)
Coordinates:
    time              (time) int32 4B ManifestArray<shape=(1,), dtype=int32, ...
    lat               (lat) float32 72kB ManifestArray<shape=(17999,), dtype=...
    lon               (lon) float32 144kB ManifestArray<shape=(36000,), dtype...
Data variables:
    mask              (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    sea_ice_fraction  (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    dt_1km_data       (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    analysed_sst      (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
    analysis_error    (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
<xarray.Dataset> Size: 4GB
Dimensions:           (time: 1, lat: 17999, lon: 36000)
Coordinates:
    time              (time) int32 4B ManifestArray<shape=(1,), dtype=int32, ...
    lat               (lat) float32 72kB ManifestArray<shape=(17999,), dtype=...
    lon               (lon) float32 144kB ManifestArray<shape=(36000,), dtype...
Data variables:
    mask              (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    sea_ice_fraction  (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    analysed_sst      (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
    analysis_error    (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...

@dcherian
Copy link

haha I knew this was an issue with model output (variables can appear and disappear) but it's news to me that it can happen with nominally-standard-archival data. FWIW I think Xarray will try to reindex dt_1km_data to the full time vector by indexing with [-1, -1, -1, -1 ..... N-1, N] which is why you get the error.

@ayushnag
Copy link
Contributor

@dcherian Yup, the error output if I change the range of results to [545:555] is:

NotImplementedError: Doesn't support slicing with (array([-1, -1, -1,  0,  1,  2,  3,  4,  5,  6]), slice(None, None, None), slice(None, None, None))

@TomNicholas
Copy link
Member

Huh. I'm not sure I fully understand yet - is this an example of trying to auto-generate indexes? Or an xarray "virtual variable" think (not virtual variable in the virtualizarr sense).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage example Real world use case examples
Projects
None yet
Development

No branches or pull requests

4 participants