-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append to icechunk stores #272
Changes from 1 commit
d3a4048
5d5f9e2
360ea14
d3c2851
a7a1e50
0365a45
5846d7e
66bbd6e
000c68f
3131167
5906687
80e4dcb
672e5e1
0fce71f
41f80f8
e60437f
de2f135
c8a46a6
7663ad7
1d704ff
af5f57d
6f4cfd9
98c7052
f36adf2
e922ccd
ca80cb2
f186d4f
2a90d9c
0253a8a
7369fcf
2949493
c305dad
aaede73
5d685c6
d704de2
29ca87c
d84c58c
f19e3d1
2505d9e
aaa7f01
5071ed7
7067b43
f867e14
0ec5084
5630b34
2fd7fea
d93f2ce
b21daca
a6c2ccb
98a676d
db3313f
145ed0e
80cd358
3035f05
28e05db
2c07cdf
19837b2
532ff38
24f7274
113cd2c
61ce01a
defe7d9
cb82d40
2f6cbc2
a442fa4
7d1bb36
39677e8
5fa7177
e109c0d
eb0e8f2
fd2df4e
1659d21
f5976d1
f903291
c109626
dd9c381
7c1fcfa
1bb2ad0
64f2478
13de8d3
4a65e5a
64e5277
d6ef97f
9d2f7f8
53045c3
b21a0e5
208e83e
79e0c1b
e38823c
ad17b83
8b9a830
3f9f58c
8496359
7dc9186
491b701
94ef469
fad188b
8df67e9
258c92f
84a4d01
299f580
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
from typing import TYPE_CHECKING, cast | ||
from typing import TYPE_CHECKING, Optional, cast | ||
|
||
import numpy as np | ||
from xarray import Dataset | ||
|
@@ -24,7 +24,9 @@ | |
} | ||
|
||
|
||
def dataset_to_icechunk(ds: Dataset, store: "IcechunkStore") -> None: | ||
def dataset_to_icechunk( | ||
ds: Dataset, store: "IcechunkStore", append_dim: Optional[str] = None | ||
) -> None: | ||
""" | ||
Write an xarray dataset whose variables wrap ManifestArrays to an Icechunk store. | ||
|
||
|
@@ -51,7 +53,10 @@ def dataset_to_icechunk(ds: Dataset, store: "IcechunkStore") -> None: | |
|
||
# TODO only supports writing to the root group currently | ||
# TODO pass zarr_format kwarg? | ||
root_group = Group.from_store(store=store) | ||
if store.mode.str == "a": | ||
root_group = Group.open(store=store, zarr_format=3) | ||
else: | ||
root_group = Group.from_store(store=store) | ||
norlandrhagen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# TODO this is Frozen, the API for setting attributes must be something else | ||
# root_group.attrs = ds.attrs | ||
|
@@ -63,6 +68,7 @@ def dataset_to_icechunk(ds: Dataset, store: "IcechunkStore") -> None: | |
ds.attrs, | ||
store=store, | ||
group=root_group, | ||
append_dim=append_dim, | ||
) | ||
|
||
|
||
|
@@ -71,6 +77,7 @@ def write_variables_to_icechunk_group( | |
attrs, | ||
store, | ||
group, | ||
append_dim: Optional[str] = None, | ||
): | ||
virtual_variables = { | ||
name: var | ||
|
@@ -96,6 +103,7 @@ def write_variables_to_icechunk_group( | |
group=group, | ||
name=name, | ||
var=var, | ||
append_dim=append_dim, | ||
) | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
|
@@ -104,6 +112,7 @@ def write_variable_to_icechunk( | |
group: "Group", | ||
name: str, | ||
var: Variable, | ||
append_dim: Optional[str] = None, | ||
) -> None: | ||
"""Write a single (possibly virtual) variable into an icechunk store""" | ||
if isinstance(var.data, ManifestArray): | ||
|
@@ -112,6 +121,7 @@ def write_variable_to_icechunk( | |
group=group, | ||
name=name, | ||
var=var, | ||
append_dim=append_dim, | ||
) | ||
else: | ||
raise ValueError( | ||
|
@@ -124,15 +134,37 @@ def write_virtual_variable_to_icechunk( | |
group: "Group", | ||
name: str, | ||
var: Variable, | ||
append_dim: Optional[str] = None, | ||
) -> None: | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Write a single virtual variable into an icechunk store""" | ||
ma = cast(ManifestArray, var.data) | ||
zarray = ma.zarray | ||
shape = zarray.shape | ||
mode = store.mode.str | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Aimee: resize the array if it already exists | ||
# TODO: assert chunking and encoding is the same | ||
existing_keys = tuple(group.array_keys()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us. |
||
append_axis, existing_num_chunks = None, None | ||
if name in existing_keys and mode == "a": | ||
# resize | ||
dims = var.dims | ||
if append_dim in dims: | ||
append_axis = dims.index(append_dim) | ||
existing_array = group[name] | ||
existing_size = existing_array.shape[append_axis] | ||
existing_num_chunks = int( | ||
existing_size / existing_array.chunks[append_axis] | ||
) | ||
new_shape = list(existing_array.shape) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks). |
||
new_shape[append_axis] += var.shape[append_axis] | ||
shape = tuple(new_shape) | ||
existing_array.resize(new_shape) | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# creates array if it doesn't already exist | ||
arr = group.require_array( | ||
name=name, | ||
shape=zarray.shape, | ||
shape=shape, | ||
chunk_shape=zarray.chunks, | ||
dtype=encode_dtype(zarray.dtype), | ||
codecs=zarray._v3_codec_pipeline(), | ||
|
@@ -142,6 +174,7 @@ def write_virtual_variable_to_icechunk( | |
) | ||
|
||
# TODO it would be nice if we could assign directly to the .attrs property | ||
# Aimee: assert that new attributes are the same as existing attributes | ||
for k, v in var.attrs.items(): | ||
arr.attrs[k] = encode_zarr_attr_value(v) | ||
arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims) | ||
|
@@ -156,6 +189,8 @@ def write_virtual_variable_to_icechunk( | |
group=group, | ||
arr_name=name, | ||
manifest=ma.manifest, | ||
append_axis=append_axis, | ||
existing_num_chunks=existing_num_chunks, | ||
) | ||
|
||
|
||
|
@@ -164,6 +199,8 @@ def write_manifest_virtual_refs( | |
group: "Group", | ||
arr_name: str, | ||
manifest: ChunkManifest, | ||
append_axis: Optional[int] = None, | ||
existing_num_chunks: Optional[int] = None, | ||
) -> None: | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Write all the virtual references for one array manifest at once.""" | ||
|
||
|
@@ -181,8 +218,14 @@ def write_manifest_virtual_refs( | |
], | ||
op_flags=[["readonly"]] * 3, # type: ignore | ||
) | ||
|
||
for path, offset, length in it: | ||
index = it.multi_index | ||
if append_axis is not None: | ||
list_index = list(index) | ||
# Offset by the number of existing chunks on the append axis | ||
list_index[append_axis] += existing_num_chunks | ||
index = tuple(list_index) | ||
chunk_key = "/".join(str(i) for i in index) | ||
|
||
# set each reference individually | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that
mode
has disappeared from the latest zarr beta and icechunk main branch.It is replaced with simply
read_only
as a property on the store. The mode will still exist on the Zarr hierarchy though I believe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I think we should leave this as-is because it doesn't look like the read_only property is working as I would expect it to in the latest released version of icechunk (
0.1.0a4
):I note from the above that the only way to open the icechunk store in append mode is still using
mode="a"
Admittedly, I'm having trouble understanding how all the AccessMode properties are being set in icechunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomNicholas what do you think about pinning the icechunk version and changing the implementation when there is a new release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if that's not wrong it's certainly highly counter-intuitive - is it reported upstream?
That sounds reasonable @abarciauskas-bgse - though hopefully by the time you add documentation to this PR this is fixed upstream 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpiannucci do you think it is worth reporting on how
read_only=False
onopen_existing
is not working as I would expect it to in the current version of icechunk (e.g. still opening in "r" mode).I'm also curious to know if there is only the boolean
read_only
option, how the other access mode properties will be handled - specifically update and overwrite (will these both just beTrue
wheneverread_only
isFalse
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see any issue with the word
read_only
in it, so I just raised it earth-mover/icechunk#404In general I feel raising duplicate issues is better than not flagging potentially-undiscovered bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pinning icechunk is the right thing to do. the
read_only
keyword isnt in the latest released version so its not a concern until then. Sorry for the confusion.Correct, both of those would just be false