Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More indexes fixes #358

Merged
merged 8 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Breaking changes
- Indexes are now created by default for any loadable one-dimensional coordinate variables.
Also a warning is no longer thrown when ``indexes=None`` is passed to ``open_virtual_dataset``, and the recommendations in the docs updated to match.
This also means that ``xarray.combine_by_coords`` will now work when the necessary dimension coordinates are specified in ``loadable_variables``.
(:issue:`18`, :pull:`357`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
(:issue:`18`, :pull:`357`, :pull:`358`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Deprecations
~~~~~~~~~~~~
Expand Down
8 changes: 3 additions & 5 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ But before we combine our data, we might want to consider loading some variables
Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:

```python
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'])
```
```python
<xarray.Dataset> Size: 31MB
Expand All @@ -228,8 +228,8 @@ You can see that the dataset contains a mixture of virtual variables backed by `
Loading variables can be useful in a few scenarios:
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
2. You want in-memory indexes to use with ``xr.combine_by_coords``,
3. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
4. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
3. Storing a variable on-disk as a set of references would be inefficient, e.g. because each chunk is very small (saving the values like this is similar to kerchunk's concept of "inlining" data),
4. The variable has complicated encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.

### CF-encoded time variables

Expand All @@ -240,7 +240,6 @@ vds = open_virtual_dataset(
'air.nc',
loadable_variables=['air', 'time'],
decode_times=True,
indexes={},
)
```
```python
Expand Down Expand Up @@ -341,7 +340,6 @@ In future we would like for it to be possible to just use `xr.open_mfdataset` to
concat_dim=['time'],
coords='minimal',
compat='override',
indexes={},
)

but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray.
Expand Down
83 changes: 42 additions & 41 deletions virtualizarr/readers/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from virtualizarr.utils import _FsspecFSFromFilepath


def open_loadable_vars_and_indexes(
def maybe_open_loadable_vars_and_indexes(
filepath: str,
loadable_variables,
reader_options,
Expand All @@ -36,50 +36,51 @@ def open_loadable_vars_and_indexes(
Relies on xr.open_dataset and its auto-detection of filetypes to find the correct installed backend.
"""

# TODO get rid of this if?
if indexes is None or len(loadable_variables) > 0:
# TODO we are reading a bunch of stuff we know we won't need here, e.g. all of the data variables...
# TODO it would also be nice if we could somehow consolidate this with the reading of the kerchunk references
# TODO really we probably want a dedicated xarray backend that iterates over all variables only once
fpath = _FsspecFSFromFilepath(
filepath=filepath, reader_options=reader_options
).open_file()

# fpath can be `Any` thanks to fsspec.filesystem(...).open() returning Any.
ds = open_dataset(
fpath, # type: ignore[arg-type]
drop_variables=drop_variables,
group=group,
decode_times=decode_times,
)

if indexes is None:
# add default indexes by reading data from file
# but avoid creating an in-memory index for virtual variables by default
indexes = {
name: index
for name, index in ds.xindexes.items()
if name in loadable_variables
}
elif indexes != {}:
# TODO allow manual specification of index objects
raise NotImplementedError()
else:
indexes = dict(**indexes) # for type hinting: to allow mutation
if loadable_variables == [] and indexes == {}:
# no need to even attempt to open the file using xarray
return {}, {}

# TODO We are reading a bunch of stuff we know we won't need here, e.g. all of the data variables...
# TODO It would also be nice if we could somehow consolidate this with the reading of the kerchunk references
# TODO Really we probably want a dedicated backend that iterates over all variables only once
# TODO See issue #124 for a suggestion of how to avoid calling xarray here.

fpath = _FsspecFSFromFilepath(
filepath=filepath, reader_options=reader_options
).open_file()

# fpath can be `Any` thanks to fsspec.filesystem(...).open() returning Any.
ds = open_dataset(
fpath, # type: ignore[arg-type]
drop_variables=drop_variables,
group=group,
decode_times=decode_times,
)

# TODO we should drop these earlier by using drop_variables
loadable_vars = {
str(name): var
for name, var in ds.variables.items()
if indexes is None:
# add default indexes by reading data from file
# but avoid creating an in-memory index for virtual variables by default
indexes = {
name: index
for name, index in ds.xindexes.items()
if name in loadable_variables
}

# if we only read the indexes we can just close the file right away as nothing is lazy
if loadable_vars == {}:
ds.close()
elif indexes != {}:
# TODO allow manual specification of index objects
raise NotImplementedError()
else:
loadable_vars = {}
indexes = {}
indexes = dict(**indexes) # for type hinting: to allow mutation

# TODO we should drop these earlier by using drop_variables
loadable_vars = {
str(name): var
for name, var in ds.variables.items()
if name in loadable_variables
}

# if we only read the indexes we can just close the file right away as nothing is lazy
if loadable_vars == {}:
ds.close()

return loadable_vars, indexes

Expand Down
23 changes: 10 additions & 13 deletions virtualizarr/readers/fits.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
from pathlib import Path
from typing import Iterable, Mapping, Optional

from xarray import Dataset, Index
from xarray import Dataset, Index, Variable

from virtualizarr.readers.common import (
VirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
)
from virtualizarr.translators.kerchunk import (
extract_group,
Expand Down Expand Up @@ -41,24 +40,22 @@ def open_virtual_dataset(
if group:
refs = extract_group(refs, group)

# TODO This wouldn't work until either you had an xarray backend for FITS installed, or issue #124 is implemented to load data from ManifestArrays directly
# TODO Once we have one of those we can use ``maybe_open_loadable_vars_and_indexes`` here
if loadable_variables != [] or indexes != {} or decode_times:
raise NotImplementedError(
"Cannot load variables or indexes from FITS files as there is no xarray backend engine for FITS"
)
loadable_vars: dict[str, Variable] = {}
indexes = {}

virtual_vars, attrs, coord_names = virtual_vars_and_metadata_from_kerchunk_refs(
refs,
loadable_variables,
drop_variables,
fs_root=Path.cwd().as_uri(),
)

# TODO this wouldn't work until you had an xarray backend for FITS installed
loadable_vars, indexes = open_loadable_vars_and_indexes(
filepath,
loadable_variables=loadable_variables,
reader_options=reader_options,
drop_variables=drop_variables,
indexes=indexes,
group=group,
decode_times=decode_times,
)

return construct_virtual_dataset(
virtual_vars=virtual_vars,
loadable_vars=loadable_vars,
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/readers/hdf/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
from .hdf import (
HDFVirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
maybe_open_loadable_vars_and_indexes,
)

__all__ = [
"HDFVirtualBackend",
"construct_virtual_dataset",
"open_loadable_vars_and_indexes",
"maybe_open_loadable_vars_and_indexes",
]
4 changes: 2 additions & 2 deletions virtualizarr/readers/hdf/hdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from virtualizarr.readers.common import (
VirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
maybe_open_loadable_vars_and_indexes,
)
from virtualizarr.readers.hdf.filters import cfcodec_from_dataset, codecs_from_dataset
from virtualizarr.types import ChunkKey
Expand Down Expand Up @@ -68,7 +68,7 @@ def open_virtual_dataset(
reader_options=reader_options,
)

loadable_vars, indexes = open_loadable_vars_and_indexes(
loadable_vars, indexes = maybe_open_loadable_vars_and_indexes(
filepath,
loadable_variables=loadable_variables,
reader_options=reader_options,
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/readers/hdf5.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from virtualizarr.readers.common import (
VirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
maybe_open_loadable_vars_and_indexes,
)
from virtualizarr.translators.kerchunk import (
extract_group,
Expand Down Expand Up @@ -54,7 +54,7 @@ def open_virtual_dataset(
fs_root=Path.cwd().as_uri(),
)

loadable_vars, indexes = open_loadable_vars_and_indexes(
loadable_vars, indexes = maybe_open_loadable_vars_and_indexes(
filepath,
loadable_variables=loadable_variables,
reader_options=reader_options,
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/readers/netcdf3.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from virtualizarr.readers.common import (
VirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
maybe_open_loadable_vars_and_indexes,
)
from virtualizarr.translators.kerchunk import (
virtual_vars_and_metadata_from_kerchunk_refs,
Expand Down Expand Up @@ -53,7 +53,7 @@ def open_virtual_dataset(
fs_root=Path.cwd().as_uri(),
)

loadable_vars, indexes = open_loadable_vars_and_indexes(
loadable_vars, indexes = maybe_open_loadable_vars_and_indexes(
filepath,
loadable_variables=loadable_variables,
reader_options=reader_options,
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/readers/tiff.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from virtualizarr.readers.common import (
VirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
maybe_open_loadable_vars_and_indexes,
)
from virtualizarr.translators.kerchunk import (
extract_group,
Expand Down Expand Up @@ -63,7 +63,7 @@ def open_virtual_dataset(
fs_root=Path.cwd().as_uri(),
)

loadable_vars, indexes = open_loadable_vars_and_indexes(
loadable_vars, indexes = maybe_open_loadable_vars_and_indexes(
filepath,
loadable_variables=loadable_variables,
reader_options=reader_options,
Expand Down
6 changes: 3 additions & 3 deletions virtualizarr/tests/test_readers/test_hdf/test_hdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,13 +168,13 @@ def test_non_group_error(self, group_hdf5_file):
@requires_imagecodecs
class TestOpenVirtualDataset:
@patch("virtualizarr.readers.hdf.hdf.construct_virtual_dataset")
@patch("virtualizarr.readers.hdf.hdf.open_loadable_vars_and_indexes")
@patch("virtualizarr.readers.hdf.hdf.maybe_open_loadable_vars_and_indexes")
def test_coord_names(
self,
open_loadable_vars_and_indexes,
maybe_open_loadable_vars_and_indexes,
construct_virtual_dataset,
root_coordinates_hdf5_file,
):
open_loadable_vars_and_indexes.return_value = (0, 0)
maybe_open_loadable_vars_and_indexes.return_value = (0, 0)
HDFVirtualBackend.open_virtual_dataset(root_coordinates_hdf5_file)
assert construct_virtual_dataset.call_args[1]["coord_names"] == ["lat", "lon"]
Loading