Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk initialization information from IceChunk manifests #448

Open
DahnJ opened this issue Dec 5, 2024 · 3 comments
Open

Chunk initialization information from IceChunk manifests #448

DahnJ opened this issue Dec 5, 2024 · 3 comments

Comments

@DahnJ
Copy link

DahnJ commented Dec 5, 2024

Context

In zarr-developers/zarr-specs#300 I roughly outlined @SylveraIO's infrastructure to support incrementally populated zarr arrays and a potential road to aligning it with open-source initiatives like IceChunk and VirtualiZarr.

The full details can be read in the issue. To sum up, there are two necessary pieces:

  1. On-disk storage format for chunk initialization information
  2. API to query this information

IceChunk manifests are a potential solution for 1. This issue aims to validate that and, if necessary, steer the manifests such that they can efficiently serve this purpose.

Query pattern

The crucial query pattern here is

"Which chunks have been initialized in this region of the array?"

This is equivalent to a list_prefix operation that lists only the keys in a specific range.

A possible first iteration of this idea is fetching the results of list_prefix and then filter.

However, that may be impractical. Our arrays have 100s of millions and up to billions of total chunks, with 10s of millions chunks initialized.

More factors to keep in mind

  • Manifests will be transferred over a network (in our case from S3)
  • The chunks can be committed over a large number of individual commits (say 1e3 to 1e5 chunks per commit)

Benchmark

This is an initial modestly-sized benchmark to have something to share, will follow up with more benchmarking.

import icechunk
import zarr
from tqdm import tqdm

storage_config = icechunk.StorageConfig.filesystem("./icechunk-large")
store = icechunk.IcechunkStore.create(storage_config)

# 1e3*1e3=1e6 chunks in total
nchunks = int(1e3)
pixels_per_chunk = 10
shape = (nchunks * pixels_per_chunk, nchunks * pixels_per_chunk)
chunks = (pixels_per_chunk, pixels_per_chunk)

group = zarr.group(store)
array = group.create("my_array", shape=shape, chunks=chunks, dtype=int)
store.commit('initialize')

# initialize 1e3*1e2=1e5 chunks
for i in tqdm(range(1000)):
    for j in range(100):
        array[i*10, j*10] = 1

store.commit('write data')

# du -sh icechunk-large/manifests/*
# 3.6M	icechunk-large/manifests/B3DBQNQ0J08HN10JGW5G

# this takes about ~7s on my machine
len([a async for a in store.list_prefix('')])
@paraseba
Copy link
Contributor

paraseba commented Dec 5, 2024

@DahnJ a few questions:

  • "Which chunks have been initialized in this region of the array?"

    Is the query going to be expressed in the chunk grid coordinates? Is the query a bounding box? For example, I want to know which of the chunks in the rectangle [0,0,0], [100,100,100] are initialized, and those 100 mean "the hundredth chunk in that dimension"

  • How big are the queries: What's the order of magnitude of the number of chunk coordinates that you expect as a response?

  • What performance would be good enough? Say for a 10M chunks array, could you give us an idea of the latency you would want for queries of different sizes?

@DahnJ
Copy link
Author

DahnJ commented Dec 6, 2024

Is the query going to be expressed in the chunk grid coordinates? Is the query a bounding box? For example, I want to know which of the chunks in the rectangle [0,0,0], [100,100,100] are initialized, and those 100 mean "the hundredth chunk in that dimension"

We can express them in chunk grid coordinates – we already have translation layers for that in place.

A common query is a bounding box spatially, but not a contiguous query in other dimensions, typically in time. So it would be a rectangle [0, 0], [100, 100] in the spatial dimensions but a non-contiguous list in other (e.g. time coordinates [0, 12, 24, 36, ..] to get the first month of each year). If this was a significant limitation though, we could still consider limiting queries to contiguous regions. This could especially be acceptable since the in-between (time) coordinates are often empty or almost empty.

  • How big are the queries: What's the order of magnitude of the number of chunk coordinates that you expect as a response?

In terms of number of initialized chunks, I expect a typical query to be between 1e3 and 1e5 chunks, with the largest queries going into 1e6-1e7.

In terms of the total number of chunks, the numbers would be larger by a factor of about 10-100. though this is probably less relevant for the manifests.

  • What performance would be good enough? Say for a 10M chunks array, could you give us an idea of the latency you would want for queries of different sizes?

Currently, with this information stored in Zarr, I can easily retrieve initialization status of 100s of millions of chunks in seconds locally, regardless of their initialization status.

However, the absolute majority of the accesses are from automatized pipelines where such query is typically a single operation in an otherwise longer-running pipeline. Furthermore, we expect IceChunk to provide performance improvements elsewhere that would outweigh a potential slower initial query. I think it's therefore absolutely acceptable for the query performance to drop.

I think this would be acceptable

  • typical query (order of 1e3-1e5 initialized chunks) taking seconds, definitely <1 minute
  • largest queries (order of 1e6-1e7 initialized chunks) could go into minutes

@paraseba
Copy link
Contributor

paraseba commented Dec 9, 2024

I don't see any problems with Icechunk having this feature, quickly after 1.0 @DahnJ . We'll make it happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants