Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for large files in ChunkManifest #176

Closed
moradology opened this issue Jul 9, 2024 · 2 comments · Fixed by #177
Closed

Support for large files in ChunkManifest #176

moradology opened this issue Jul 9, 2024 · 2 comments · Fixed by #177

Comments

@moradology
Copy link
Contributor

A couple of the three crucial fields for representing chunks in ChunkManifest currently use numpy's int32. This is inadequate for files that may have chunks starting beyond 2GB into the file. We can double that with uint32 or (as seems perhaps necessary given some of the monstrous netcdfs I've come across) just bite the bullet and implement with int64/uint64. https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/manifests/manifest.py#L74-L75

I think this is probably only strictly necessary for offsets but is probably also advisable for lengths. I guess the question is how concerned we are with handling chunks larger than 2GB (4GB with uint32)

In practice, this is what we should expect from files that require offsets beyond the limits of int32: OverflowError: Python integer 2661073188 out of bounds for int32

@TomNicholas
Copy link
Member

TomNicholas commented Jul 9, 2024

I think this is probably only strictly necessary for offsets but is probably also advisable for lengths.

People making unchunked netCDF files with one variable that exceeds 2/4GB seems very plausible to me.

All of this is to handle cases with very inadvisably-sized chunks, but I think we want to future-proof as much as we can. uint64 would mean no issues unless someone comes along with a netCDF file bigger than ~19 ExaBytes!

just bite the bullet and implement with int64/uint64

For the particular example in #104 (comment) using uint64 for both length and offset would actually only increase the in-memory size of the ManifestArray by a factor 2, not 4, but that's just because the paths field is long and dominates the overall size.

In practice, this is what we should expect from files that require offsets beyond the limits of int32: OverflowError: Python integer 2661073188 out of bounds for int32

Alternatively we could catch and re-raise this error to explain the context.

@cisaacstern
Copy link
Collaborator

uint64 would mean no issues unless someone comes along with a netCDF file bigger than ~19 ExaBytes

This is a really fun thread to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants