Chunk initialization information from IceChunk manifests #448

DahnJ · 2024-12-05T15:18:24Z

Context

In zarr-developers/zarr-specs#300 I roughly outlined @SylveraIO's infrastructure to support incrementally populated zarr arrays and a potential road to aligning it with open-source initiatives like IceChunk and VirtualiZarr.

The full details can be read in the issue. To sum up, there are two necessary pieces:

On-disk storage format for chunk initialization information
API to query this information

IceChunk manifests are a potential solution for 1. This issue aims to validate that and, if necessary, steer the manifests such that they can efficiently serve this purpose.

Query pattern

The crucial query pattern here is

"Which chunks have been initialized in this region of the array?"

This is equivalent to a list_prefix operation that lists only the keys in a specific range.

A possible first iteration of this idea is fetching the results of list_prefix and then filter.

However, that may be impractical. Our arrays have 100s of millions and up to billions of total chunks, with 10s of millions chunks initialized.

More factors to keep in mind

Manifests will be transferred over a network (in our case from S3)
The chunks can be committed over a large number of individual commits (say 1e3 to 1e5 chunks per commit)

Benchmark

This is an initial modestly-sized benchmark to have something to share, will follow up with more benchmarking.

import icechunk
import zarr
from tqdm import tqdm

storage_config = icechunk.StorageConfig.filesystem("./icechunk-large")
store = icechunk.IcechunkStore.create(storage_config)

# 1e3*1e3=1e6 chunks in total
nchunks = int(1e3)
pixels_per_chunk = 10
shape = (nchunks * pixels_per_chunk, nchunks * pixels_per_chunk)
chunks = (pixels_per_chunk, pixels_per_chunk)

group = zarr.group(store)
array = group.create("my_array", shape=shape, chunks=chunks, dtype=int)
store.commit('initialize')

# initialize 1e3*1e2=1e5 chunks
for i in tqdm(range(1000)):
    for j in range(100):
        array[i*10, j*10] = 1

store.commit('write data')

# du -sh icechunk-large/manifests/*
# 3.6M	icechunk-large/manifests/B3DBQNQ0J08HN10JGW5G

# this takes about ~7s on my machine
len([a async for a in store.list_prefix('')])

The text was updated successfully, but these errors were encountered:

paraseba · 2024-12-05T16:57:08Z

@DahnJ a few questions:

"Which chunks have been initialized in this region of the array?"

Is the query going to be expressed in the chunk grid coordinates? Is the query a bounding box? For example, I want to know which of the chunks in the rectangle [0,0,0], [100,100,100] are initialized, and those 100 mean "the hundredth chunk in that dimension"
How big are the queries: What's the order of magnitude of the number of chunk coordinates that you expect as a response?
What performance would be good enough? Say for a 10M chunks array, could you give us an idea of the latency you would want for queries of different sizes?

DahnJ · 2024-12-06T01:07:20Z

Is the query going to be expressed in the chunk grid coordinates? Is the query a bounding box? For example, I want to know which of the chunks in the rectangle [0,0,0], [100,100,100] are initialized, and those 100 mean "the hundredth chunk in that dimension"

We can express them in chunk grid coordinates – we already have translation layers for that in place.

A common query is a bounding box spatially, but not a contiguous query in other dimensions, typically in time. So it would be a rectangle [0, 0], [100, 100] in the spatial dimensions but a non-contiguous list in other (e.g. time coordinates [0, 12, 24, 36, ..] to get the first month of each year). If this was a significant limitation though, we could still consider limiting queries to contiguous regions. This could especially be acceptable since the in-between (time) coordinates are often empty or almost empty.

How big are the queries: What's the order of magnitude of the number of chunk coordinates that you expect as a response?

In terms of number of initialized chunks, I expect a typical query to be between 1e3 and 1e5 chunks, with the largest queries going into 1e6-1e7.

In terms of the total number of chunks, the numbers would be larger by a factor of about 10-100. though this is probably less relevant for the manifests.

What performance would be good enough? Say for a 10M chunks array, could you give us an idea of the latency you would want for queries of different sizes?

Currently, with this information stored in Zarr, I can easily retrieve initialization status of 100s of millions of chunks in seconds locally, regardless of their initialization status.

However, the absolute majority of the accesses are from automatized pipelines where such query is typically a single operation in an otherwise longer-running pipeline. Furthermore, we expect IceChunk to provide performance improvements elsewhere that would outweigh a potential slower initial query. I think it's therefore absolutely acceptable for the query performance to drop.

I think this would be acceptable

typical query (order of 1e3-1e5 initialized chunks) taking seconds, definitely <1 minute
largest queries (order of 1e6-1e7 initialized chunks) could go into minutes

paraseba · 2024-12-09T16:49:05Z

I don't see any problems with Icechunk having this feature, quickly after 1.0 @DahnJ . We'll make it happen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk initialization information from IceChunk manifests #448

Chunk initialization information from IceChunk manifests #448

DahnJ commented Dec 5, 2024

paraseba commented Dec 5, 2024

DahnJ commented Dec 6, 2024 •

edited

Loading

paraseba commented Dec 9, 2024 •

edited

Loading

Chunk initialization information from IceChunk manifests #448

Chunk initialization information from IceChunk manifests #448

Comments

DahnJ commented Dec 5, 2024

Context

Query pattern

Benchmark

paraseba commented Dec 5, 2024

DahnJ commented Dec 6, 2024 • edited Loading

paraseba commented Dec 9, 2024 • edited Loading

DahnJ commented Dec 6, 2024 •

edited

Loading

paraseba commented Dec 9, 2024 •

edited

Loading