Support for large files in ChunkManifest #176

moradology · 2024-07-09T18:30:23Z

A couple of the three crucial fields for representing chunks in ChunkManifest currently use numpy's int32. This is inadequate for files that may have chunks starting beyond 2GB into the file. We can double that with uint32 or (as seems perhaps necessary given some of the monstrous netcdfs I've come across) just bite the bullet and implement with int64/uint64. https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/manifests/manifest.py#L74-L75

I think this is probably only strictly necessary for offsets but is probably also advisable for lengths. I guess the question is how concerned we are with handling chunks larger than 2GB (4GB with uint32)

In practice, this is what we should expect from files that require offsets beyond the limits of int32: OverflowError: Python integer 2661073188 out of bounds for int32

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-07-09T18:41:29Z

I think this is probably only strictly necessary for offsets but is probably also advisable for lengths.

People making unchunked netCDF files with one variable that exceeds 2/4GB seems very plausible to me.

All of this is to handle cases with very inadvisably-sized chunks, but I think we want to future-proof as much as we can. uint64 would mean no issues unless someone comes along with a netCDF file bigger than ~19 ExaBytes!

just bite the bullet and implement with int64/uint64

For the particular example in #104 (comment) using uint64 for both length and offset would actually only increase the in-memory size of the ManifestArray by a factor 2, not 4, but that's just because the paths field is long and dominates the overall size.

In practice, this is what we should expect from files that require offsets beyond the limits of int32: OverflowError: Python integer 2661073188 out of bounds for int32

Alternatively we could catch and re-raise this error to explain the context.

cisaacstern · 2024-07-09T22:14:24Z

uint64 would mean no issues unless someone comes along with a netCDF file bigger than ~19 ExaBytes

This is a really fun thread to read.

TomNicholas added the performance label Jul 9, 2024

moradology mentioned this issue Jul 9, 2024

Future-proof offset and size records in chunkmanifest #177

Merged

1 task

TomNicholas closed this as completed in #177 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for large files in ChunkManifest #176

Support for large files in ChunkManifest #176

moradology commented Jul 9, 2024

TomNicholas commented Jul 9, 2024 •

edited

Loading

cisaacstern commented Jul 9, 2024

Support for large files in ChunkManifest #176

Support for large files in ChunkManifest #176

Comments

moradology commented Jul 9, 2024

TomNicholas commented Jul 9, 2024 • edited Loading

cisaacstern commented Jul 9, 2024

TomNicholas commented Jul 9, 2024 •

edited

Loading