Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append to icechunk stores #272

Merged
merged 101 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
d3a4048
Initial attempt at appending
abarciauskas-bgse Oct 25, 2024
5d5f9e2
Working on tests for generate chunk key function
abarciauskas-bgse Oct 25, 2024
360ea14
Linting
abarciauskas-bgse Oct 26, 2024
d3c2851
Refactor gen virtual dataset method
abarciauskas-bgse Oct 26, 2024
a7a1e50
Fix spelling
abarciauskas-bgse Oct 27, 2024
0365a45
Linting
abarciauskas-bgse Oct 28, 2024
5846d7e
Linting
abarciauskas-bgse Oct 28, 2024
66bbd6e
Linting
abarciauskas-bgse Oct 30, 2024
000c68f
Passing compression test
abarciauskas-bgse Nov 1, 2024
3131167
Merge branch 'main' into icechunk-append
TomNicholas Nov 5, 2024
5906687
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 5, 2024
80e4dcb
merge main and linting
abarciauskas-bgse Nov 10, 2024
672e5e1
linting
abarciauskas-bgse Nov 10, 2024
0fce71f
Fix test failing due to incorrect dtype
abarciauskas-bgse Nov 10, 2024
41f80f8
Fix conflict in conftest
abarciauskas-bgse Nov 10, 2024
e60437f
linting
abarciauskas-bgse Nov 10, 2024
de2f135
Linting
abarciauskas-bgse Nov 10, 2024
c8a46a6
Remove obsolete test file for appending
abarciauskas-bgse Nov 10, 2024
7663ad7
Create netcdf4 files factor in conftest
abarciauskas-bgse Nov 11, 2024
1d704ff
Linting
abarciauskas-bgse Nov 11, 2024
af5f57d
Refactor to use combineable zarr arrays
abarciauskas-bgse Nov 12, 2024
6f4cfd9
linting
abarciauskas-bgse Nov 12, 2024
98c7052
Implement no append dim test
abarciauskas-bgse Nov 12, 2024
f36adf2
Add test for when append dim is not in dims
abarciauskas-bgse Nov 12, 2024
e922ccd
Fix mypy errors
abarciauskas-bgse Nov 15, 2024
ca80cb2
type ignore import untyped zarr
abarciauskas-bgse Nov 15, 2024
f186d4f
Use Union type for check_combineable_zarr_arrays arg
abarciauskas-bgse Nov 15, 2024
2a90d9c
Fix import
abarciauskas-bgse Nov 15, 2024
0253a8a
Fix imports for get_codecs
abarciauskas-bgse Nov 15, 2024
7369fcf
use new factory in test
abarciauskas-bgse Nov 15, 2024
2949493
Remove need for dask in fixture
abarciauskas-bgse Nov 15, 2024
c305dad
Fix for when zarr is not installed
abarciauskas-bgse Nov 15, 2024
aaede73
Address test failures
abarciauskas-bgse Nov 16, 2024
5d685c6
Add get_codecs file
abarciauskas-bgse Nov 16, 2024
d704de2
Add dask to upstream
abarciauskas-bgse Nov 16, 2024
29ca87c
Merge branch 'main' into icechunk-append
abarciauskas-bgse Nov 16, 2024
d84c58c
Remove dependency on dask and h5netcdf engine
abarciauskas-bgse Nov 17, 2024
f19e3d1
Merge branch 'icechunk-append' of github.com:zarr-developers/virtuali…
abarciauskas-bgse Nov 17, 2024
2505d9e
Remove obsolete comment
abarciauskas-bgse Nov 18, 2024
aaa7f01
Remove duplicate zarr array type check
abarciauskas-bgse Nov 18, 2024
5071ed7
Move codecs module and type output
abarciauskas-bgse Nov 19, 2024
7067b43
Actually add codecs file
abarciauskas-bgse Nov 19, 2024
f867e14
Merge branch 'main' into icechunk-append
abarciauskas-bgse Nov 19, 2024
0ec5084
Fix merge mistake
abarciauskas-bgse Nov 19, 2024
5630b34
Ignore import untyped
abarciauskas-bgse Nov 19, 2024
2fd7fea
Merge branch 'main' into icechunk-append
abarciauskas-bgse Nov 20, 2024
d93f2ce
Add tests for codecs
abarciauskas-bgse Nov 20, 2024
b21daca
Merge branch 'icechunk-append' of github.com:zarr-developers/virtuali…
abarciauskas-bgse Nov 20, 2024
a6c2ccb
Resolve mypy errors
abarciauskas-bgse Nov 20, 2024
98a676d
Fix test
abarciauskas-bgse Nov 20, 2024
db3313f
Import zarr in function
abarciauskas-bgse Nov 20, 2024
145ed0e
Use existing importorskip function
abarciauskas-bgse Nov 20, 2024
80cd358
Modify comments
abarciauskas-bgse Nov 20, 2024
3035f05
Comment updates and spelling of combinable
abarciauskas-bgse Nov 20, 2024
28e05db
Revert change to check compatible encoding
abarciauskas-bgse Nov 20, 2024
2c07cdf
Ignore zarr untyped import errors
abarciauskas-bgse Nov 20, 2024
19837b2
Implement a manifest.utils module
abarciauskas-bgse Nov 21, 2024
532ff38
pass the array into resize_array
abarciauskas-bgse Nov 21, 2024
24f7274
Refactor resize_array
abarciauskas-bgse Nov 21, 2024
113cd2c
Remove unnecessary zarr imports
abarciauskas-bgse Nov 21, 2024
61ce01a
Add pinned version of icechunk as an optional dependency
abarciauskas-bgse Nov 21, 2024
defe7d9
Add append_dim in docstring
abarciauskas-bgse Nov 22, 2024
cb82d40
Kludgy solution to v2 v3 codecs difference
abarciauskas-bgse Nov 22, 2024
2f6cbc2
Add normalize to v3 parameter
abarciauskas-bgse Nov 22, 2024
a442fa4
Add more info to docstring
abarciauskas-bgse Nov 22, 2024
7d1bb36
Fix typing issues
abarciauskas-bgse Nov 22, 2024
39677e8
Add decorator for zarr python v3 test
abarciauskas-bgse Nov 22, 2024
5fa7177
Fix mypy and ruff errors
abarciauskas-bgse Nov 22, 2024
e109c0d
Only append if append_dim in dims
abarciauskas-bgse Nov 23, 2024
eb0e8f2
Add example notebook
abarciauskas-bgse Nov 25, 2024
fd2df4e
Add a runtime
abarciauskas-bgse Nov 25, 2024
1659d21
Add failing test
abarciauskas-bgse Nov 25, 2024
f5976d1
Fix multiple appends
abarciauskas-bgse Nov 25, 2024
f903291
Fix test error message
abarciauskas-bgse Nov 25, 2024
c109626
Add new cell to notebook to display original time chunk
abarciauskas-bgse Nov 26, 2024
dd9c381
Upgrade icechunk to 1.0.0a5
abarciauskas-bgse Nov 26, 2024
7c1fcfa
Upgrade icechunk in upstream.yml
abarciauskas-bgse Nov 26, 2024
1bb2ad0
Updated notebook with kechunk comment an upgraded icechunk version
abarciauskas-bgse Nov 26, 2024
64f2478
Modify test so it fails without updated icechunk
abarciauskas-bgse Dec 4, 2024
13de8d3
Update icechunk dependency
abarciauskas-bgse Dec 4, 2024
4a65e5a
Fix mypy errors
abarciauskas-bgse Dec 4, 2024
64e5277
update icechunk version in pyproject
abarciauskas-bgse Dec 4, 2024
d6ef97f
Merge branch 'main' into icechunk-append
abarciauskas-bgse Dec 4, 2024
9d2f7f8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2024
53045c3
Remove obsolete comment
abarciauskas-bgse Dec 4, 2024
b21a0e5
Use icechunk 0.1.0a7
abarciauskas-bgse Dec 5, 2024
208e83e
Updated notebook
abarciauskas-bgse Dec 5, 2024
79e0c1b
Updated notebook
abarciauskas-bgse Dec 5, 2024
e38823c
print store
abarciauskas-bgse Dec 5, 2024
ad17b83
Update notebook (#327)
mpiannucci Dec 5, 2024
8b9a830
Add append to examples
abarciauskas-bgse Dec 5, 2024
3f9f58c
Add to releases.rst
abarciauskas-bgse Dec 5, 2024
8496359
Revert change to .gitignore
abarciauskas-bgse Dec 5, 2024
7dc9186
Merge branch 'main' into icechunk-append
TomNicholas Dec 5, 2024
491b701
Update ci/upstream.yml
abarciauskas-bgse Dec 5, 2024
94ef469
Update pyproject.toml
abarciauskas-bgse Dec 5, 2024
fad188b
Update virtualizarr/tests/test_writers/test_icechunk.py
abarciauskas-bgse Dec 5, 2024
8df67e9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 5, 2024
258c92f
Update virtualizarr/accessor.py
abarciauskas-bgse Dec 5, 2024
84a4d01
Separate out multiple arrays test
abarciauskas-bgse Dec 5, 2024
299f580
Merge branch 'icechunk-append' of github.com:zarr-developers/virtuali…
abarciauskas-bgse Dec 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,30 @@ def netcdf4_file(tmpdir):
return filepath


@pytest.fixture
def compressed_netcdf4_files(tmpdir):
# without chunks={} we get a compression error: zlib.error: Error -3 while decompressing data: incorrect header check
ds = xr.tutorial.open_dataset("air_temperature", chunks={})
ds1 = ds.isel(time=slice(None, 1460))
ds2 = ds.isel(time=slice(1460, None))
# Define compression options for NetCDF
encoding = {
# without encoding the chunksizes, irregular ones are chosen
var: dict(zlib=True, complevel=4, chunksizes=(1460, 25, 53))
for var in ds.data_vars
}

# Save it to disk as netCDF (in temporary directory)
filepath1 = f"{tmpdir}/air1_compressed.nc"
filepath2 = f"{tmpdir}/air2_compressed.nc"
ds1.to_netcdf(filepath1, encoding=encoding, engine="h5netcdf")
ds2.to_netcdf(filepath2, encoding=encoding, engine="h5netcdf")
ds1.close()
ds2.close()

return filepath1, filepath2


@pytest.fixture
def netcdf4_file_with_2d_coords(tmpdir):
ds = xr.tutorial.open_dataset("ROMS_example")
Expand Down
13 changes: 5 additions & 8 deletions virtualizarr/accessor.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
from pathlib import Path
from typing import (
TYPE_CHECKING,
Callable,
Literal,
overload,
)
from typing import TYPE_CHECKING, Callable, Literal, Optional, overload

from xarray import Dataset, register_dataset_accessor

Expand Down Expand Up @@ -43,7 +38,9 @@ def to_zarr(self, storepath: str) -> None:
"""
dataset_to_zarr(self.ds, storepath)

def to_icechunk(self, store: "IcechunkStore") -> None:
def to_icechunk(
self, store: "IcechunkStore", append_dim: Optional[str] = None
) -> None:
"""
Write an xarray dataset to an Icechunk store.

Expand All @@ -55,7 +52,7 @@ def to_icechunk(self, store: "IcechunkStore") -> None:
"""
from virtualizarr.writers.icechunk import dataset_to_icechunk

dataset_to_icechunk(self.ds, store)
dataset_to_icechunk(self.ds, store, append_dim=append_dim)

@overload
def to_kerchunk(
Expand Down
46 changes: 44 additions & 2 deletions virtualizarr/tests/test_writers/test_icechunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from zarr import Array, Group, group # type: ignore[import-untyped]

from virtualizarr.manifests import ChunkManifest, ManifestArray
from virtualizarr.writers.icechunk import dataset_to_icechunk
from virtualizarr.writers.icechunk import dataset_to_icechunk, generate_chunk_key
from virtualizarr.zarr import ZArray

if TYPE_CHECKING:
Expand Down Expand Up @@ -127,7 +127,7 @@ def test_set_single_virtual_ref_with_encoding(
# vds = open_virtual_dataset(netcdf4_file, indexes={})

expected_ds = open_dataset(netcdf4_file).drop_vars(["lon", "lat", "time"])
# these atyttirbutes encode floats different and I am not sure why, but its not important enough to block everything
# these attributes encode floats different and I am not sure why, but its not important enough to block everything
expected_ds.air.attrs.pop("actual_range")

# instead for now just write out byte ranges explicitly
Expand Down Expand Up @@ -288,3 +288,45 @@ def test_write_loadable_variable(
# TODO test writing to a group that isn't the root group

# TODO test with S3 / minio


def test_generate_chunk_key_no_offset():
# Test case without any offset (append_axis and existing_num_chunks are None)
index = (1, 2, 3)
result = generate_chunk_key(index)
assert result == "1/2/3", "The chunk key should match the index without any offset."


def test_generate_chunk_key_with_offset():
# Test case with offset on append_axis 1
index = (1, 2, 3)
append_axis = 1
existing_num_chunks = 5
result = generate_chunk_key(
index, append_axis=append_axis, existing_num_chunks=existing_num_chunks
)
assert result == "1/7/3", "The chunk key should offset the second index by 5."


def test_generate_chunk_key_zero_offset():
# Test case where existing_num_chunks is 0 (no offset should be applied)
index = (4, 5, 6)
append_axis = 1
existing_num_chunks = 0
result = generate_chunk_key(
index, append_axis=append_axis, existing_num_chunks=existing_num_chunks
)
assert (
result == "4/5/6"
), "No offset should be applied when existing_num_chunks is 0."


def test_generate_chunk_key_append_axis_out_of_bounds():
# Edge case where append_axis is out of bounds
index = (3, 4)
append_axis = 2 # This is out of bounds for a 2D index
with pytest.raises(IndexError):
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
generate_chunk_key(index, append_axis=append_axis, existing_num_chunks=1)


# Run tests using pytest
Loading
Loading