Add tutorial notebook for `open_virtual_dataset` #903

ayushnag · 2024-12-17T15:47:20Z

The newly added open_virtual_dataset and open_virtual_mfdataset functions need a tutorial notebook to show example usage. The current sections I have planned are:

Opening a dataset
Opening a virtual dataset, saving the reference file, and loading the dataset with xarray
Opening a virtual dataset with groups
Opening a virtual dataset with preprocessing

cc @betolink @TomNicholas @danielfromearth

The text was updated successfully, but these errors were encountered:

betolink · 2024-12-17T16:17:03Z

Thanks for opening this @ayushnag!

danielfromearth · 2024-12-17T19:45:28Z

I just tried running this new functionality in an Openscapes Jupyterhub instance. Used !pip install . --upgrade to install earthaccess directly from the main branch (after running git pull).

A ModuleNotFoundError: No module named 'virtualizarr' error was raised when executing the earthaccess.open_virtual_mfdataset() function. Does virtualizarr need to be added to the pyproject.toml?

danielfromearth · 2024-12-17T19:51:59Z

Oh! I see now that it is declared in the optional-dependencies section of the pyproject.toml. So, I should be able to get past that error now. Do we want to move it into the main dependencies though?

ayushnag · 2024-12-17T19:57:02Z

I think doing pip install "earthaccess[virtualizarr]" will work once this code is in the release version. For now I believe the method is pip install 'git+https://github.com/nsidc/earthaccess.git@main#egg=earthaccess[virtualizarr]'

chuckwondo · 2024-12-17T20:10:58Z

You should be able to do pip install .[virtualizarr]

danielfromearth · 2024-12-17T20:43:55Z

After installing via pip install .[virtualizarr], I get a few couple messages after that. Thoughts on the following?

1) A numpy version warning upon import of `xarray` (though perhaps this is simply a result of `pip install`ing earthaccess into the base environment for Openscapes' Jupyterhub?):

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

2. A `ValueError` during an HDF import statement in `virtualizarr`:

granules = earthaccess.search_data(
    short_name="TEMPO_NO2_L2",
    version="V03",
    count=3
)

result = earthaccess.open_virtual_mfdataset(
    granules=granules, 
    access="indirect", 
    concat_dim="time",  
    parallel=False, 
    preprocess=None
)

...raises the following (note: I truncated the beginning of the callback, which starts with File [~/earthaccess/earthaccess/dmrpp_zarr.py:91], in open_virtual_mfdataset(granules, group, access, load, preprocess, parallel, **xr_combine_nested_kwargs))...

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/hdf.py:20](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/hdf.py#line=19)
     14 from virtualizarr.manifests.manifest import validate_and_normalize_path_to_uri
     15 from virtualizarr.readers.common import (
     16     VirtualBackend,
     17     construct_virtual_dataset,
     18     open_loadable_vars_and_indexes,
     19 )
---> 20 from virtualizarr.readers.hdf.filters import cfcodec_from_dataset, codecs_from_dataset
     21 from virtualizarr.types import ChunkKey
     22 from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions, soft_import

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/filters.py:16](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/filters.py#line=15)
     13     import h5py  # type: ignore
     14     from h5py import Dataset, Group  # type: ignore
---> 16 h5py = soft_import("h5py", "For reading hdf files", strict=False)
     17 if h5py:
     18     Dataset = h5py.Dataset

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/utils.py:94](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/utils.py#line=93), in soft_import(name, reason, strict)
     92 def soft_import(name: str, reason: str, strict: Optional[bool] = True):
     93     try:
---> 94         return importlib.import_module(name)
     95     except (ImportError, ModuleNotFoundError):
     96         if strict:

File [/srv/conda/envs/notebook/lib/python3.10/importlib/__init__.py:126](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/importlib/__init__.py#line=125), in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/h5py/__init__.py:25](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/h5py/__init__.py#line=24)
     19 # --- Library setup -----------------------------------------------------------
     20 
     21 # When importing from the root of the unpacked tarball or git checkout,
     22 # Python sees the "h5py" source directory and tries to load it, which fails.
     23 # We tried working around this by using "package_dir" but that breaks Cython.
     24 try:
---> 25     from . import _errors
     26 except ImportError:
     27     import os.path as _op

File h5py[/_errors.pyx:1](https://openscapes.2i2c.cloud/_errors.pyx#line=0), in init h5py._errors()

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

chuckwondo · 2024-12-17T20:48:40Z

though perhaps this is simply a result of pip installing earthaccess into the base environment for Openscapes' Jupyterhub?

Definitely could be problematic, so try creating a new env instead, and see if you get the same problems.

TomNicholas · 2024-12-17T21:06:58Z

@danielfromearth (1) and (2) are related - VirtualiZarr has a hard requirement to use numpy 2.0.0 or later because it makes a lot of use of the new variable-length string style internally. I think something in your environment is not compiled with numpy 2.

battistowx · 2024-12-19T21:37:33Z

@danielfromearth I'm currently using hatch and the pyproject.toml to experiment on my local desktop, I would avoid using the 2i2c Jupyterhub for now since it has dependency locks

danielfromearth · 2024-12-20T13:46:48Z

@battistowx, I've also tried it locally without issue, but part of my goal with testing was to try it in-region on AWS West 2. I'm still not sure how to modify the environment in 2i2c Jupyterhub, and if there are dependency locks, is there a way around that for testing purposes?

betolink · 2024-12-20T14:05:21Z

@danielfromearth great timing!! I was testing the latest image for the hub, try restarting your instance in the admin console and use openscapes/python:07980b9 we still need to reinstall earthaccess from source because the virtualizar work has not been released. We should release it today or Monday!

danielfromearth · 2024-12-20T15:01:54Z

Okay, just tried this in an instance of openscapes/python:07980b9. When running this same code as in my previous comment:

result = earthaccess.open_virtual_mfdataset(
    granules=results, 
    access="indirect", 
    concat_dim="time",  
    parallel=False, 
    preprocess=None
)

The following new error is being raised by xarray:

TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>

This is out of my depth. Is it missing some optional dependency, something that provides a "Chunk Manager" that will know how to handle ManifestArrays?

ayushnag · 2024-12-20T15:21:57Z

@danielfromearth You need to also pass in the arguments coords='minimal', compat='override' to open_virtual_mfdataset to avoid xarray trying to load indexes on this dataset. Note that combining references will become much easier when virtualizarr natively supports open_virtual_mfdataset. The current earthaccess.open_virtual_mfdataset implements some logic that will become upstream in virtualizarr. Then we can call that function in earthaccess

danielfromearth · 2024-12-20T16:13:12Z

Please disregard my previous (now-hidden) comment! I discovered a typo that prevented the code from working (I was opening the same group twice and then trying to merge it with itself).

TomNicholas · 2024-12-20T16:29:53Z

Sorry @danielfromearth - VirtualiZarr's xarray-at-the-top design means that sometimes obscure errors are thrown from deep inside xarray, that VirtualiZarr doesn't have control to re-raise with clearer messages. I have issues to track ways to make them clearer, but changing xarray is a more involved process than changing VirtualiZarr. (xref zarr-developers/VirtualiZarr#114 and pydata/xarray#8778)

For xr.merge you may also need to pass the same compat='override' again.

I've tried to document this here but if you think any of this could be clearer in the VirtualiZarr docs please raise an issue there :)

betolink · 2024-12-20T17:00:53Z

before opening an issue in VirtualiZarr, I was trying to open 1000 MUR granules and ran into this:

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:369](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=368), in IndexCallable.__getitem__(self, key)
    368 def __getitem__(self, key: Any) -> Any:
--> 369     return self.getter(key)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:1508](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=1507), in NumpyIndexingAdapter._oindex_get(self, indexer)
   1506 def _oindex_get(self, indexer: OuterIndexer):
   1507     key = _outer_to_numpy_indexer(indexer, self.array.shape)
-> 1508     return self.array[key]

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py:214](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py#line=213), in ManifestArray.__getitem__(self, key)
    212     return self
    213 else:
--> 214     raise NotImplementedError(f"Doesn't support slicing with {indexer}")

does it ring a bell? it worked for a few granules 🤔

ayushnag · 2024-12-20T17:04:38Z

@betolink Could you share the code you used to get the error?

TomNicholas · 2024-12-20T17:17:48Z

NotImplementedError(f"Doesn't support slicing with {indexer}")

Could be zarr-developers/VirtualiZarr#51, but I would need more context.

Also please raise new issues on VirtualiZarr!! That way other people will see them and have the opportunity to jump in and help.

betolink · 2024-12-20T17:36:24Z

Opened zarr-developers/VirtualiZarr#360

battistowx · 2024-12-20T18:06:12Z

I've ran into a few interesting errors as well when certain dependencies are not installed. I'll document those as well in Virtualizarr! Also, I'd love to be a part of this tutorial project and help where I can!

ayushnag · 2024-12-20T18:10:18Z

@battistowx What specific errors have you ran into? Perhaps that is an error in the earthaccess dependencies not virtualizarr that we need to add to the earthaccess[virtualizarr] optional dependency group. One I know of for sure is to add zarr. Since we have the load=True param we need zarr for that portion.

battistowx · 2024-12-20T18:42:00Z

@ayushnag Yes, adding Zarr was an easy fix and that error appeared when importing earthaccess. I was also getting TypeError: Union[arg, ...]: each arg must be a type. in _extract_attrs when h5py was not installed, and a KeyError when using the access='direct' argument outside of us-west-2. These only require minor fixes, such as clearer exception statements and additions to the earthaccess dependency group.

ayushnag · 2024-12-20T19:28:12Z

@battistowx We can fix most of the dependency problems with an added integration test that checks that load=True works. The access='direct' one is challenging since the only way we can catch that is by knowing whether the user is in us-west-2 or not. However I think this is an ongoing issue since we don't have a clear way of determining if a user is "in-region"

github-project-automation bot added this to earthaccess project Dec 17, 2024

github-project-automation bot moved this to 🆕 New in earthaccess project Dec 17, 2024

ayushnag mentioned this issue Dec 17, 2024

Add earthaccess-virtualizarr tutorial notebook #904

Open

6 tasks

This comment was marked as outdated.

Sign in to view

betolink mentioned this issue Dec 20, 2024

Slicing not supported with Indexer when trying to open many MUR files. zarr-developers/VirtualiZarr#360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tutorial notebook for `open_virtual_dataset` #903

Add tutorial notebook for `open_virtual_dataset` #903

ayushnag commented Dec 17, 2024

betolink commented Dec 17, 2024

danielfromearth commented Dec 17, 2024 •

edited

Loading

danielfromearth commented Dec 17, 2024

ayushnag commented Dec 17, 2024

chuckwondo commented Dec 17, 2024

danielfromearth commented Dec 17, 2024 •

edited

Loading

chuckwondo commented Dec 17, 2024

TomNicholas commented Dec 17, 2024

battistowx commented Dec 19, 2024 •

edited

Loading

danielfromearth commented Dec 20, 2024 •

edited

Loading

betolink commented Dec 20, 2024

danielfromearth commented Dec 20, 2024 •

edited

Loading

ayushnag commented Dec 20, 2024

This comment was marked as outdated.

danielfromearth commented Dec 20, 2024

TomNicholas commented Dec 20, 2024 •

edited

Loading

betolink commented Dec 20, 2024

ayushnag commented Dec 20, 2024

TomNicholas commented Dec 20, 2024

betolink commented Dec 20, 2024

battistowx commented Dec 20, 2024

ayushnag commented Dec 20, 2024

battistowx commented Dec 20, 2024

ayushnag commented Dec 20, 2024

Add tutorial notebook for open_virtual_dataset #903

Add tutorial notebook for open_virtual_dataset #903

Comments

ayushnag commented Dec 17, 2024

betolink commented Dec 17, 2024

danielfromearth commented Dec 17, 2024 • edited Loading

danielfromearth commented Dec 17, 2024

ayushnag commented Dec 17, 2024

chuckwondo commented Dec 17, 2024

danielfromearth commented Dec 17, 2024 • edited Loading

1) A numpy version warning upon import of xarray (though perhaps this is simply a result of pip installing earthaccess into the base environment for Openscapes' Jupyterhub?):

2. A ValueError during an HDF import statement in virtualizarr:

chuckwondo commented Dec 17, 2024

TomNicholas commented Dec 17, 2024

battistowx commented Dec 19, 2024 • edited Loading

danielfromearth commented Dec 20, 2024 • edited Loading

betolink commented Dec 20, 2024

danielfromearth commented Dec 20, 2024 • edited Loading

ayushnag commented Dec 20, 2024

This comment was marked as outdated.

danielfromearth commented Dec 20, 2024

TomNicholas commented Dec 20, 2024 • edited Loading

betolink commented Dec 20, 2024

ayushnag commented Dec 20, 2024

TomNicholas commented Dec 20, 2024

betolink commented Dec 20, 2024

battistowx commented Dec 20, 2024

ayushnag commented Dec 20, 2024

battistowx commented Dec 20, 2024

ayushnag commented Dec 20, 2024

Add tutorial notebook for `open_virtual_dataset` #903

Add tutorial notebook for `open_virtual_dataset` #903

danielfromearth commented Dec 17, 2024 •

edited

Loading

danielfromearth commented Dec 17, 2024 •

edited

Loading

1) A numpy version warning upon import of `xarray` (though perhaps this is simply a result of `pip install`ing earthaccess into the base environment for Openscapes' Jupyterhub?):

2. A `ValueError` during an HDF import statement in `virtualizarr`:

battistowx commented Dec 19, 2024 •

edited

Loading

danielfromearth commented Dec 20, 2024 •

edited

Loading

danielfromearth commented Dec 20, 2024 •

edited

Loading

TomNicholas commented Dec 20, 2024 •

edited

Loading