Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append to icechunk stores #272

Merged
merged 101 commits into from
Dec 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
d3a4048
Initial attempt at appending
abarciauskas-bgse Oct 25, 2024
5d5f9e2
Working on tests for generate chunk key function
abarciauskas-bgse Oct 25, 2024
360ea14
Linting
abarciauskas-bgse Oct 26, 2024
d3c2851
Refactor gen virtual dataset method
abarciauskas-bgse Oct 26, 2024
a7a1e50
Fix spelling
abarciauskas-bgse Oct 27, 2024
0365a45
Linting
abarciauskas-bgse Oct 28, 2024
5846d7e
Linting
abarciauskas-bgse Oct 28, 2024
66bbd6e
Linting
abarciauskas-bgse Oct 30, 2024
000c68f
Passing compression test
abarciauskas-bgse Nov 1, 2024
3131167
Merge branch 'main' into icechunk-append
TomNicholas Nov 5, 2024
5906687
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 5, 2024
80e4dcb
merge main and linting
abarciauskas-bgse Nov 10, 2024
672e5e1
linting
abarciauskas-bgse Nov 10, 2024
0fce71f
Fix test failing due to incorrect dtype
abarciauskas-bgse Nov 10, 2024
41f80f8
Fix conflict in conftest
abarciauskas-bgse Nov 10, 2024
e60437f
linting
abarciauskas-bgse Nov 10, 2024
de2f135
Linting
abarciauskas-bgse Nov 10, 2024
c8a46a6
Remove obsolete test file for appending
abarciauskas-bgse Nov 10, 2024
7663ad7
Create netcdf4 files factor in conftest
abarciauskas-bgse Nov 11, 2024
1d704ff
Linting
abarciauskas-bgse Nov 11, 2024
af5f57d
Refactor to use combineable zarr arrays
abarciauskas-bgse Nov 12, 2024
6f4cfd9
linting
abarciauskas-bgse Nov 12, 2024
98c7052
Implement no append dim test
abarciauskas-bgse Nov 12, 2024
f36adf2
Add test for when append dim is not in dims
abarciauskas-bgse Nov 12, 2024
e922ccd
Fix mypy errors
abarciauskas-bgse Nov 15, 2024
ca80cb2
type ignore import untyped zarr
abarciauskas-bgse Nov 15, 2024
f186d4f
Use Union type for check_combineable_zarr_arrays arg
abarciauskas-bgse Nov 15, 2024
2a90d9c
Fix import
abarciauskas-bgse Nov 15, 2024
0253a8a
Fix imports for get_codecs
abarciauskas-bgse Nov 15, 2024
7369fcf
use new factory in test
abarciauskas-bgse Nov 15, 2024
2949493
Remove need for dask in fixture
abarciauskas-bgse Nov 15, 2024
c305dad
Fix for when zarr is not installed
abarciauskas-bgse Nov 15, 2024
aaede73
Address test failures
abarciauskas-bgse Nov 16, 2024
5d685c6
Add get_codecs file
abarciauskas-bgse Nov 16, 2024
d704de2
Add dask to upstream
abarciauskas-bgse Nov 16, 2024
29ca87c
Merge branch 'main' into icechunk-append
abarciauskas-bgse Nov 16, 2024
d84c58c
Remove dependency on dask and h5netcdf engine
abarciauskas-bgse Nov 17, 2024
f19e3d1
Merge branch 'icechunk-append' of github.com:zarr-developers/virtuali…
abarciauskas-bgse Nov 17, 2024
2505d9e
Remove obsolete comment
abarciauskas-bgse Nov 18, 2024
aaa7f01
Remove duplicate zarr array type check
abarciauskas-bgse Nov 18, 2024
5071ed7
Move codecs module and type output
abarciauskas-bgse Nov 19, 2024
7067b43
Actually add codecs file
abarciauskas-bgse Nov 19, 2024
f867e14
Merge branch 'main' into icechunk-append
abarciauskas-bgse Nov 19, 2024
0ec5084
Fix merge mistake
abarciauskas-bgse Nov 19, 2024
5630b34
Ignore import untyped
abarciauskas-bgse Nov 19, 2024
2fd7fea
Merge branch 'main' into icechunk-append
abarciauskas-bgse Nov 20, 2024
d93f2ce
Add tests for codecs
abarciauskas-bgse Nov 20, 2024
b21daca
Merge branch 'icechunk-append' of github.com:zarr-developers/virtuali…
abarciauskas-bgse Nov 20, 2024
a6c2ccb
Resolve mypy errors
abarciauskas-bgse Nov 20, 2024
98a676d
Fix test
abarciauskas-bgse Nov 20, 2024
db3313f
Import zarr in function
abarciauskas-bgse Nov 20, 2024
145ed0e
Use existing importorskip function
abarciauskas-bgse Nov 20, 2024
80cd358
Modify comments
abarciauskas-bgse Nov 20, 2024
3035f05
Comment updates and spelling of combinable
abarciauskas-bgse Nov 20, 2024
28e05db
Revert change to check compatible encoding
abarciauskas-bgse Nov 20, 2024
2c07cdf
Ignore zarr untyped import errors
abarciauskas-bgse Nov 20, 2024
19837b2
Implement a manifest.utils module
abarciauskas-bgse Nov 21, 2024
532ff38
pass the array into resize_array
abarciauskas-bgse Nov 21, 2024
24f7274
Refactor resize_array
abarciauskas-bgse Nov 21, 2024
113cd2c
Remove unnecessary zarr imports
abarciauskas-bgse Nov 21, 2024
61ce01a
Add pinned version of icechunk as an optional dependency
abarciauskas-bgse Nov 21, 2024
defe7d9
Add append_dim in docstring
abarciauskas-bgse Nov 22, 2024
cb82d40
Kludgy solution to v2 v3 codecs difference
abarciauskas-bgse Nov 22, 2024
2f6cbc2
Add normalize to v3 parameter
abarciauskas-bgse Nov 22, 2024
a442fa4
Add more info to docstring
abarciauskas-bgse Nov 22, 2024
7d1bb36
Fix typing issues
abarciauskas-bgse Nov 22, 2024
39677e8
Add decorator for zarr python v3 test
abarciauskas-bgse Nov 22, 2024
5fa7177
Fix mypy and ruff errors
abarciauskas-bgse Nov 22, 2024
e109c0d
Only append if append_dim in dims
abarciauskas-bgse Nov 23, 2024
eb0e8f2
Add example notebook
abarciauskas-bgse Nov 25, 2024
fd2df4e
Add a runtime
abarciauskas-bgse Nov 25, 2024
1659d21
Add failing test
abarciauskas-bgse Nov 25, 2024
f5976d1
Fix multiple appends
abarciauskas-bgse Nov 25, 2024
f903291
Fix test error message
abarciauskas-bgse Nov 25, 2024
c109626
Add new cell to notebook to display original time chunk
abarciauskas-bgse Nov 26, 2024
dd9c381
Upgrade icechunk to 1.0.0a5
abarciauskas-bgse Nov 26, 2024
7c1fcfa
Upgrade icechunk in upstream.yml
abarciauskas-bgse Nov 26, 2024
1bb2ad0
Updated notebook with kechunk comment an upgraded icechunk version
abarciauskas-bgse Nov 26, 2024
64f2478
Modify test so it fails without updated icechunk
abarciauskas-bgse Dec 4, 2024
13de8d3
Update icechunk dependency
abarciauskas-bgse Dec 4, 2024
4a65e5a
Fix mypy errors
abarciauskas-bgse Dec 4, 2024
64e5277
update icechunk version in pyproject
abarciauskas-bgse Dec 4, 2024
d6ef97f
Merge branch 'main' into icechunk-append
abarciauskas-bgse Dec 4, 2024
9d2f7f8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2024
53045c3
Remove obsolete comment
abarciauskas-bgse Dec 4, 2024
b21a0e5
Use icechunk 0.1.0a7
abarciauskas-bgse Dec 5, 2024
208e83e
Updated notebook
abarciauskas-bgse Dec 5, 2024
79e0c1b
Updated notebook
abarciauskas-bgse Dec 5, 2024
e38823c
print store
abarciauskas-bgse Dec 5, 2024
ad17b83
Update notebook (#327)
mpiannucci Dec 5, 2024
8b9a830
Add append to examples
abarciauskas-bgse Dec 5, 2024
3f9f58c
Add to releases.rst
abarciauskas-bgse Dec 5, 2024
8496359
Revert change to .gitignore
abarciauskas-bgse Dec 5, 2024
7dc9186
Merge branch 'main' into icechunk-append
TomNicholas Dec 5, 2024
491b701
Update ci/upstream.yml
abarciauskas-bgse Dec 5, 2024
94ef469
Update pyproject.toml
abarciauskas-bgse Dec 5, 2024
fad188b
Update virtualizarr/tests/test_writers/test_icechunk.py
abarciauskas-bgse Dec 5, 2024
8df67e9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 5, 2024
258c92f
Update virtualizarr/accessor.py
abarciauskas-bgse Dec 5, 2024
84a4d01
Separate out multiple arrays test
abarciauskas-bgse Dec 5, 2024
299f580
Merge branch 'icechunk-append' of github.com:zarr-developers/virtuali…
abarciauskas-bgse Dec 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 47 additions & 4 deletions virtualizarr/writers/icechunk.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import TYPE_CHECKING, cast
from typing import TYPE_CHECKING, Optional, cast

import numpy as np
from xarray import Dataset
Expand All @@ -24,7 +24,9 @@
}


def dataset_to_icechunk(ds: Dataset, store: "IcechunkStore") -> None:
def dataset_to_icechunk(
ds: Dataset, store: "IcechunkStore", append_dim: Optional[str] = None
) -> None:
"""
Write an xarray dataset whose variables wrap ManifestArrays to an Icechunk store.

Expand All @@ -51,7 +53,10 @@ def dataset_to_icechunk(ds: Dataset, store: "IcechunkStore") -> None:

# TODO only supports writing to the root group currently
# TODO pass zarr_format kwarg?
root_group = Group.from_store(store=store)
if store.mode.str == "a":
root_group = Group.open(store=store, zarr_format=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that mode has disappeared from the latest zarr beta and icechunk main branch.

It is replaced with simply read_only as a property on the store. The mode will still exist on the Zarr hierarchy though I believe.

Copy link
Collaborator Author

@abarciauskas-bgse abarciauskas-bgse Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think we should leave this as-is because it doesn't look like the read_only property is working as I would expect it to in the latest released version of icechunk (0.1.0a4):

from icechunk import IcechunkStore, StorageConfig
config = StorageConfig.filesystem("local")

store = IcechunkStore.create(storage=config, read_only=False)
print(store.mode)
# AccessMode(str='w', readonly=False, overwrite=True, create=True, update=False)

store = IcechunkStore.open_existing(storage=config)
print(store.mode) 
# AccessMode(str='r', readonly=True, overwrite=False, create=False, update=False)

store = IcechunkStore.open_existing(storage=config, read_only=False) 
print(store.mode)
# AccessMode(str='r', readonly=True, overwrite=False, create=False, update=False)

store = IcechunkStore.open_existing(storage=config, mode="a")
print(store.mode)
# AccessMode(str='a', readonly=False, overwrite=False, create=True, update=True)

store = IcechunkStore.open_existing(storage=config, mode="w")
print(store.mode)
# AccessMode(str=w', readonly=False, overwrite=True, create=True, update=False)

store = IcechunkStore.open_existing(storage=config, update=True)
print(store.mode)
# AccessMode(str='r', readonly=True, overwrite=False, create=False, update=False)

I note from the above that the only way to open the icechunk store in append mode is still using mode="a"

Admittedly, I'm having trouble understanding how all the AccessMode properties are being set in icechunk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas what do you think about pinning the icechunk version and changing the implementation when there is a new release?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't look like the read_only property is working as I would expect

Even if that's not wrong it's certainly highly counter-intuitive - is it reported upstream?

pinning the icechunk version and changing the implementation when there is a new release?

That sounds reasonable @abarciauskas-bgse - though hopefully by the time you add documentation to this PR this is fixed upstream 🤞

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpiannucci do you think it is worth reporting on how read_only=False on open_existing is not working as I would expect it to in the current version of icechunk (e.g. still opening in "r" mode).

I'm also curious to know if there is only the boolean read_only option, how the other access mode properties will be handled - specifically update and overwrite (will these both just be True whenever read_only is False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it is worth reporting

I can't see any issue with the word read_only in it, so I just raised it earth-mover/icechunk#404

In general I feel raising duplicate issues is better than not flagging potentially-undiscovered bugs.

Copy link
Contributor

@mpiannucci mpiannucci Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pinning icechunk is the right thing to do. the read_only keyword isnt in the latest released version so its not a concern until then. Sorry for the confusion.

specifically update and overwrite (will these both just be True whenever read_only is False

Correct, both of those would just be false

else:
root_group = Group.from_store(store=store)
norlandrhagen marked this conversation as resolved.
Show resolved Hide resolved

# TODO this is Frozen, the API for setting attributes must be something else
# root_group.attrs = ds.attrs
Expand All @@ -63,6 +68,7 @@ def dataset_to_icechunk(ds: Dataset, store: "IcechunkStore") -> None:
ds.attrs,
store=store,
group=root_group,
append_dim=append_dim,
)


Expand All @@ -71,6 +77,7 @@ def write_variables_to_icechunk_group(
attrs,
store,
group,
append_dim: Optional[str] = None,
):
virtual_variables = {
name: var
Expand All @@ -96,6 +103,7 @@ def write_variables_to_icechunk_group(
group=group,
name=name,
var=var,
append_dim=append_dim,
)
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved


Expand All @@ -104,6 +112,7 @@ def write_variable_to_icechunk(
group: "Group",
name: str,
var: Variable,
append_dim: Optional[str] = None,
) -> None:
"""Write a single (possibly virtual) variable into an icechunk store"""
if isinstance(var.data, ManifestArray):
Expand All @@ -112,6 +121,7 @@ def write_variable_to_icechunk(
group=group,
name=name,
var=var,
append_dim=append_dim,
)
else:
raise ValueError(
Expand All @@ -124,15 +134,37 @@ def write_virtual_variable_to_icechunk(
group: "Group",
name: str,
var: Variable,
append_dim: Optional[str] = None,
) -> None:
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
"""Write a single virtual variable into an icechunk store"""
ma = cast(ManifestArray, var.data)
zarray = ma.zarray
shape = zarray.shape
mode = store.mode.str
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

# Aimee: resize the array if it already exists
# TODO: assert chunking and encoding is the same
existing_keys = tuple(group.array_keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us.

append_axis, existing_num_chunks = None, None
if name in existing_keys and mode == "a":
# resize
dims = var.dims
if append_dim in dims:
append_axis = dims.index(append_dim)
existing_array = group[name]
existing_size = existing_array.shape[append_axis]
existing_num_chunks = int(
existing_size / existing_array.chunks[append_axis]
)
new_shape = list(existing_array.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks).

new_shape[append_axis] += var.shape[append_axis]
shape = tuple(new_shape)
existing_array.resize(new_shape)
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

# creates array if it doesn't already exist
arr = group.require_array(
name=name,
shape=zarray.shape,
shape=shape,
chunk_shape=zarray.chunks,
dtype=encode_dtype(zarray.dtype),
codecs=zarray._v3_codec_pipeline(),
Expand All @@ -142,6 +174,7 @@ def write_virtual_variable_to_icechunk(
)

# TODO it would be nice if we could assign directly to the .attrs property
# Aimee: assert that new attributes are the same as existing attributes
for k, v in var.attrs.items():
arr.attrs[k] = encode_zarr_attr_value(v)
arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims)
Expand All @@ -156,6 +189,8 @@ def write_virtual_variable_to_icechunk(
group=group,
arr_name=name,
manifest=ma.manifest,
append_axis=append_axis,
existing_num_chunks=existing_num_chunks,
)


Expand All @@ -164,6 +199,8 @@ def write_manifest_virtual_refs(
group: "Group",
arr_name: str,
manifest: ChunkManifest,
append_axis: Optional[int] = None,
existing_num_chunks: Optional[int] = None,
) -> None:
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
"""Write all the virtual references for one array manifest at once."""

Expand All @@ -181,8 +218,14 @@ def write_manifest_virtual_refs(
],
op_flags=[["readonly"]] * 3, # type: ignore
)

for path, offset, length in it:
index = it.multi_index
if append_axis is not None:
list_index = list(index)
# Offset by the number of existing chunks on the append axis
list_index[append_axis] += existing_num_chunks
index = tuple(list_index)
chunk_key = "/".join(str(i) for i in index)

# set each reference individually
Expand Down
Loading