This repository has been archived by the owner on Nov 12, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Write virtual references to Icechunk #1
Write virtual references to Icechunk #1
Changes from 17 commits
7b00e41
c82221c
d29362b
2aa3cb5
f2c095c
6abe32d
93080b3
bebf370
833e5f0
9853140
030a96e
90ca5cf
b138dde
6631102
e92b56c
2c8c0ee
a2ce1ed
9393995
956e324
2d7d5f6
8726e23
387f345
b2a0700
4ffb55e
f929fcb
e9c1287
2fe3012
33d8ce8
8aa6034
aa2d415
7e4e2ce
9a03245
9676485
bbaf3ba
31945cd
49daa7e
756ff92
8c7242e
666b676
9076ad7
7a160fd
286a9b5
8f1f96e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the right approach to me. its work stealing not parallel so remember that to keep perf expecations in check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but it will set them in serial. You can do this in parallel, creating the tasks with
asyncio.ensure_future
instead of callingawait
. Then you just gather them as you do above into a single future to return out to await later, or you can await the gathered future in the function, depending on the level of concurrency you desireThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, but wouldn't it be better for this iteration to just happen on the icechunk end instead? That would both simplify this code and presumably be more efficient overall.
e.g. can icechunk expose a method for setting the virtual refs of an entire array at once like
then do the loop over the chunks in rust? Or do you think that's departing from the zarr-like abstraction that icechunk presents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a technical level supporting this on the rust side would be fast, but i am worried about a little about departure from being a Zarr store first and foremost and leaking the abstraction. This would most likely require a new dependency on the numpy rust bindings to be efficient enough to make a difference.
@rabernat also mentioned the possibility of adding another package specifically to make icechunk/virtualizarr more efficient which is possible as another option if we dont want to put it in main python bindings.
I think in the short term, lets focus on getting it to work as is and leave this as future optimization after we have some idea of the real world performance? Adding bulk will not be as hard as rounding out all the initial support things IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reasonable, but the alternative requires me to iterate in python over many millions of elements of numpy arrays, and send every single one off to icechunk as a separate async request. That seems unnecessary gymnastics when we already have all the elements arranged very neatly in-memory.
Seems kinda over-complicated, but could definitely solve the problem.
Agree that getting it to work correctly with a nice user API is the priority for now, and we can worry about this again after measuring performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea to think about would be if virtualizarr's implementation of the chunk manifest actually used icechunk's rust implementation
zarr-developers/VirtualiZarr#23
This would be a stronger argument for a separate package IMO - a rust crate implementing the
Manifest
class that bothicechunk
andvirtualizarr
depended on, and could be used to exchange references efficiently between the two libraries.cc @rabernat