You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A couple of the three crucial fields for representing chunks in ChunkManifest currently use numpy's int32. This is inadequate for files that may have chunks starting beyond 2GB into the file. We can double that with uint32 or (as seems perhaps necessary given some of the monstrous netcdfs I've come across) just bite the bullet and implement with int64/uint64. https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/manifests/manifest.py#L74-L75
I think this is probably only strictly necessary for offsets but is probably also advisable for lengths. I guess the question is how concerned we are with handling chunks larger than 2GB (4GB with uint32)
In practice, this is what we should expect from files that require offsets beyond the limits of int32: OverflowError: Python integer 2661073188 out of bounds for int32
The text was updated successfully, but these errors were encountered:
I think this is probably only strictly necessary for offsets but is probably also advisable for lengths.
People making unchunked netCDF files with one variable that exceeds 2/4GB seems very plausible to me.
All of this is to handle cases with very inadvisably-sized chunks, but I think we want to future-proof as much as we can. uint64 would mean no issues unless someone comes along with a netCDF file bigger than ~19 ExaBytes!
just bite the bullet and implement with int64/uint64
For the particular example in #104 (comment) using uint64 for both length and offset would actually only increase the in-memory size of the ManifestArray by a factor 2, not 4, but that's just because the paths field is long and dominates the overall size.
In practice, this is what we should expect from files that require offsets beyond the limits of int32: OverflowError: Python integer 2661073188 out of bounds for int32
Alternatively we could catch and re-raise this error to explain the context.
A couple of the three crucial fields for representing chunks in
ChunkManifest
currently use numpy'sint32
. This is inadequate for files that may have chunks starting beyond 2GB into the file. We can double that withuint32
or (as seems perhaps necessary given some of the monstrous netcdfs I've come across) just bite the bullet and implement withint64
/uint64
. https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/manifests/manifest.py#L74-L75I think this is probably only strictly necessary for
offsets
but is probably also advisable forlengths
. I guess the question is how concerned we are with handling chunks larger than 2GB (4GB with uint32)In practice, this is what we should expect from files that require offsets beyond the limits of
int32
:OverflowError: Python integer 2661073188 out of bounds for int32
The text was updated successfully, but these errors were encountered: