Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-memory representation of chunks: array instead of a dict? #33

Closed
TomNicholas opened this issue Mar 14, 2024 · 13 comments · Fixed by #107
Closed

In-memory representation of chunks: array instead of a dict? #33

TomNicholas opened this issue Mar 14, 2024 · 13 comments · Fixed by #107

Comments

@TomNicholas
Copy link
Member

Currently chunks are stored as a mapping of chunk keys to chunk entries, i.e. a manifest dict. The ManifestArray is just a thin wrapper over this, it's not really an array at all internally. The main purpose of it is to do lazy concatenation, which is implemented via manipulating manifest dicts.

However, as was pointed out by @dcherian in fsspec/kerchunk#377 (comment), a lazily concatenated array is essentially just a chunked array. We could imagine an alternative design where ManifestArray works more like a dask array, which holds a grid of chunks, each of which is itself an array (i.e. numpy arrays).

It might make sense to re-implement ManifestArray to organise chunks in the same way dask does, perhaps storing some sort of ChunkReferenceArray object in place of numpy arrays. This would then handle concatenation at the chunk level, allow for indexing (as long as the indexers aligned with chunk boundaries), and possibly be implemented by vendoring code from inside dask.

In this design the ChunkManifest would be something that could be built from the ManifestArray when needed, not the fundamental object. This change could be done without changing the existing API of ManifestArray.

@TomNicholas
Copy link
Member Author

This is important for #16 and null value handling in general (#32), as it would allow for a ManifestArray of arbitrary shape to be backed a manifest with all chunks, some chunks, or no chunks at all (so returning fill_value everywhere).

@rabernat
Copy link
Collaborator

rabernat commented Mar 14, 2024

There are basically three pieces of data you need for a chunk reference. This is roughly what our data structure look like in Arraylake

class ReferenceData:
    uri: str
    offset: int
    length: int

Storing these in different separate Zarr arrays would offer major advantages in terms of compression. For the int data, we could use the delta codec, plus a lossless compressor like Zstd, to massively crush down the data. For the uris, the VlenUTF8 (plus lossless compression) would work great.

It's likely that we could store millions of references this way using < 1MB of storage.

@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 14, 2024

Your ReferenceData class is the same as my ChunkEntry class, i.e. one entry in a ChunkManifest.

https://github.com/TomNicholas/VirtualiZarr/blob/1e9273864b0e74a52249cf34b6d2afb5049e9e76/virtualizarr/manifests/manifest.py#L17

Storing these in different separate Zarr arrays would offer major advantages in terms of compression. For the int data, we could use the delta codec, plus a lossless compressor like Zstd, to massively crush down the data. For the uris, the VlenUTF8 (plus lossless compression) would work great.

It's likely that we could store millions of references this way using < 1MB of storage.

This seems like a cool idea but possibly orthogonal to the design issue I'm talking about above? Sounds like you're suggesting a particular on-disk storage format for the chunk references. In this issue I'm talking only about the in-memory representation of the chunked n-dimensional array + manifest. We can write that to disk in a number of ways (as kerchunk json, as kerchunk parquet, or as your triple-zarr-array suggestion here etc.).

EDIT: That's a great point about compression though - the URL and length fields are likely to exhibit very little variation over the whole array, and so compress down to almost nothing.

@rabernat
Copy link
Collaborator

Yeah I see what you mean. In my defense, the title of this issues does begin with the word "Store"! 😆

For the issue you are talking about (in memory representation of references), I think having an array indexed by chunk positions makes a lot of sense. The main downside would be in the case of very sparsely populated manifests (relative to the full array), in which case this would involve a lot of unused memory compared to the dict. I suppose you could opt to use an in-memory sparse array for that case.

@TomNicholas TomNicholas changed the title Store an array of chunks instead of a dict? In-memory representation of chunks: array instead of a dict? Mar 15, 2024
@rabernat
Copy link
Collaborator

Getting super meta here...

What if we used an Xarray Dataset for the in-memory references, with three different variables (path, offset, length)? Then we could use Xarray to do the concatenation, broadcasting, etc. For example, this would really simplify concat_manifests.

There is something very satisfying about the idea of using Xarray itself to manage the manifests.

@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 15, 2024

Interesting... I'm not seeing how that would really work though. We need to use different xarray Variables at the top-level for different netCDF variables / zarr arrays, so we would need 3 variables (path, offset, length) per variable (lat, lon, temp, etc.).

Or are you suggesting using xarray twice, at two levels of abstraction? Once to hold the chunk grid in each ManifestArray, and once to hold all the ManifestArrays. That would be kind of wild.

Another idea that's along the same lines would be to store the manifest entries in a structured numpy array, i.e. using a structured dtype. But that would require xarray to be able to wrap structured numpy arrays, which I bet it can't right now.

EDIT: Actually xarray wouldn't have to directly wrap the structured array, it would wrap a ManifestArray that wraps a structured array... That could actually work...

Also the concat_manifests isn't really that complicated in my opinion. The concatenation and broadcasting of the chunk manifests is pretty easy to describe just as manipulation of the chunk keys, that's one of the things I like about this design. It works well as an abstraction, it's only a problem if we think it's really inefficient / can't represent some case we need.

@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 15, 2024

This structured array idea could work nicely... You have a ManifestArray that has the shape and dtype of the netcdf variable / zarr array it refers to, but it wraps a single structured array with shape corresponding to the shape of the chunk grid.

The structured array has 3 fields, for path, offset and length. Offset and length are ints, but for the path we could use the variable-length string dtype that it looks like was merged into numpy only about a month ago! EDIT: Ahh crap but it won't come out until numpy 2.0

All concatenation / broadcasting of the ManifestArray would just defer to numpy handling concatenation / broadcasting of the wrapped structured array (which should then be super efficient). The ManifestArray just has a similar job to now - it carts around the correct .zarray info and calculates the new overall shape.

Or you could go even more meta and have a ManifestArray wrap a dask array which wraps numpy structured arrays? Then you can do concat via tree-reduce on the references...

@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 15, 2024

One downside of all these in-memory array ideas compared the the dictionary chunk manifest we currently have is I don't know how the array would support missing chunks. The structured array wouldn't have anywhere you could put a NaN either.

EDIT: I guess a path containing only an empty string could be understood to represent a missing chunk?

@rabernat
Copy link
Collaborator

The current dictionary encoding is basically a poor man's sparse array. 😆

@rabernat
Copy link
Collaborator

I guess a path containing only an empty string could be understood to represent a missing chunk?

Or we could have a separate bitmask array.

The nice thing about manifest as dataset is that you can attach all kinds of chunk-level statistics. For example, chunk min, max, sum, count, etc.

@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 15, 2024

The current dictionary encoding is basically a poor man's sparse array. 😆

Haha yeah it kinda is 😅

Or we could have a separate bitmask array.

Okay so there are potentially multiple solutions to the missing chunk issue. @jhamman pointed out today that the main use case of this library is arrays where you do have every chunk (because netCDF files don't just omit chunks), so I don't think we are really talking about very sparse arrays anyway. The main reason to be able to represent NaNs is for padding with them (#22).

The nice thing about manifest as dataset is that you can attach all kinds of chunk-level statistics. For example, chunk min, max, sum, count, etc.

You could do that in a structured array too, just by having extra fields. Not sure what the use case of those chunk-level statistics is in the context of this library though.

I'm tempted to make a PR to try out the structured array idea, as I feel like that's the most memory-optimized data structure we can use to represent the manifest (without writing something ourselves in rust #23).

@TomNicholas
Copy link
Member Author

TomNicholas commented May 9, 2024

See #104 (comment) for a simple experiment showing that using 3 (dense) numpy arrays we should be able to represent a manifest that points to 1 million chunks using only ~24MB in-memory.

Also note that apparently it isn't possible to put numpy 2.0's variable-length string dtype into a numpy structured array (see Jeremy's comment zarr-developers/zarr-specs#287 (comment)), which means I need to change how I had started implementing #39 to use 3 separate arrays (for path, offset, and length) instead.

@dcherian
Copy link

dcherian commented May 9, 2024

FWIW I think the 3 array approach will be more performant: https://numpy.org/doc/stable/user/basics.rec.html#introduction

For instance, the C-struct-like memory layout of structured arrays in numpy can lead to poor cache behavior in comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants