-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk initialization information from IceChunk manifests #448
Comments
@DahnJ a few questions:
|
We can express them in chunk grid coordinates – we already have translation layers for that in place. A common query is a bounding box spatially, but not a contiguous query in other dimensions, typically in time. So it would be a rectangle
In terms of number of initialized chunks, I expect a typical query to be between 1e3 and 1e5 chunks, with the largest queries going into 1e6-1e7. In terms of the total number of chunks, the numbers would be larger by a factor of about 10-100. though this is probably less relevant for the manifests.
Currently, with this information stored in Zarr, I can easily retrieve initialization status of 100s of millions of chunks in seconds locally, regardless of their initialization status. However, the absolute majority of the accesses are from automatized pipelines where such query is typically a single operation in an otherwise longer-running pipeline. Furthermore, we expect IceChunk to provide performance improvements elsewhere that would outweigh a potential slower initial query. I think it's therefore absolutely acceptable for the query performance to drop. I think this would be acceptable
|
I don't see any problems with Icechunk having this feature, quickly after 1.0 @DahnJ . We'll make it happen. |
Context
In zarr-developers/zarr-specs#300 I roughly outlined @SylveraIO's infrastructure to support incrementally populated zarr arrays and a potential road to aligning it with open-source initiatives like IceChunk and VirtualiZarr.
The full details can be read in the issue. To sum up, there are two necessary pieces:
IceChunk manifests are a potential solution for 1. This issue aims to validate that and, if necessary, steer the manifests such that they can efficiently serve this purpose.
Query pattern
The crucial query pattern here is
This is equivalent to a
list_prefix
operation that lists only the keys in a specific range.A possible first iteration of this idea is fetching the results of
list_prefix
and then filter.However, that may be impractical. Our arrays have 100s of millions and up to billions of total chunks, with 10s of millions chunks initialized.
More factors to keep in mind
Benchmark
This is an initial modestly-sized benchmark to have something to share, will follow up with more benchmarking.
The text was updated successfully, but these errors were encountered: