-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entrypoint for storing metadata that scales with number of chunks #305
Comments
thanks for raising this issue @TomNicholas! This is very related to the discussion here: #72 to elaborate on this point:
After thinking a bit more about it, I actually see two ways to make this happen in zarr v3. The first way uses the
I see pros and cons for either of these approaches, but I would love to see someone take either one and implement it. |
My latest thinking on both this issue and chunk manifests involves going back to a concept that I have previously been critical of--delegating more responsibility and functionality to the Store. Currently Stores can only tell us basically three pieces of information about chunks: "does the chunk exist?", "how big is it?" and "what are its bytes?". These are the type of information that filesystems can readily provide. But there's no reason that a more specialized store couldn't hold more information about each chunk. Of course, this just defers the problem of how to store this metadata to the Store. But that might be okay. That storage spec can evolve separately from the Zarr spec itself. If we go this route, the main thing we would have to define at the Zarr level is the interface for getting and setting chunk-level metadata, rather than the exact format. At Earthmover, we are working on a new, open-source and open-spec Zarr store that comprises transactional updates, chunk manifests, and chunk-level metadata. I believe that the our design addresses all of the scalability concerns identified above. We aim to have something to share soon. Sorry for being vague--our intention is to release this in a fully baked form. |
I'll add a couple of asides:
In general, storing other per-chunk information in a sidecar binary file (sqlite, parquet, whatever) if fine, and what kerchunk already does after all. If you want such information to make it to the codecs, then you indeed also need a way to pass those things around - and that is where contexts come in.
We don't want de/encoding to be done by the store, however. In such a model, a store and its unique internal implementations becomes the whole of zarr, for each store type. |
This seems reasonable to me.
Where is the v3 storage spec? It just says "under construction".
Looking forward to discussing this once we hear the details!
It can scale more slowly, but it still scales with the data, which IMO means there is still a potential for scaling issues.
May require, but only if none of the dimensions are much longer than the others. In the pathological case of a 1D zarr store it scales just as badly as the chunk manifest does! Whilst that's the worst case, it doesn't seem unlikely that people make quasi-1D stores to hold e.g. time series or genomics datasets. A store like that with millions of chunks may still have 100's of thousands of chunk length values. |
I think we need a single general solution for how to store metadata which scales with the number of chunks in an array.
Context
Zarr aims to be arbitrarily scalable, via the assumption that in the model of
zarr.json
metadata + chunks of bytes then it doesn't matter how many chunks there are in a given zarr array, the metadata for that array will be of a constant small manageable size.This assumption is broken in multiple proposed zarr enhancements that I am aware of:
In all of these cases there is some type of metadata that we want to include in the store the size of which grows with the number of chunks. In (1) the proposal is to store a path, offset, and byte range for each chunk. In (2) it's to store a set of scalars per chunk (e.g. mean, median, mode, min, max). In (3) it's to store a 1D series of the lengths of each chunk along each dimension, which therefore scales with the number of chunks along one dimension, rather than the total number of chunks. The "context" idea in (4) I believe was to have certain zarr-array-level metadata, particularly related to encoding, be defineable on a per-chunk basis. My understanding is that that idea failed to gain traction but it's still another example of wanting to save per-chunk information in the store. I imagine there might be more ideas with this property in the future.
Problem of scale
We need the ability for each of these types of metadata to also be arbitrarily large. For example I personally want to use the chunk manifest idea to create a "virtual" zarr store with arrays which each have ~500k chunks, which if stored as json would imply requiring ~0.5MB per array to store the metadata comprising the chunk manifest alone (and I have ~20 arrays like that so 10MB of metadata already).
All the use cases above have different approaches to this same problem. In the storage manifest transformer proposal (1) there is basically a section in the
zarr.json
which tells the reader that to go and get a certain piece of metadata it has to look in a separate file (themanifest.json
). We then discussed whether that should actually be a parquet file or even another zarr array (with shape = chunk grid shape). (2) similarly proposes solving this using additional zarr arrays in a special_accumulation_group
in the store. (3) doesn't solve it either, as the chunk sizes are just an array in the json metadata file (though it does mention parquet as used by kerchunk to solve this problem).This problem has been identified as separable and some specific solutions proposed, e.g. a zarr v3 extension for an external attributes file #229 (comment) (@rabernat) and support for non-JSON metadata and attributes #37, but those comments don't really identify the common thread of metadata with scales with number of chunks.
General entrypoint for metadata which scales with # of chunks
The variable-length chunking example seems like a particularly important example - in the chunk manifest example you only need to know what's in the manifest when you actually want to read bytes of data from the store, but with variable-length chunks you might want to know those lengths even when you merely list the contents of the store.
So whilst the suggested implementation for the chunk manifest within the v3 spec is to use a storage transformer, I wonder if that approach wouldn't actually really work for the other use cases above, and instead we should have some dedicated mechanism to use every time we want to have any metadata field which scales with the number of chunks. It would have a common syntax within the
zarr.json
file, but then we could either make a choice about what format in which to store the chunk-scaling metadata (e.g. parquet or zarr) or try to have that be flexible too (allowing for pointing to a database or whatever).(In)compatibility with the v3 Spec
I spoke to @d-v-b about this at length and he seemed to think that there is no easy way to do this kind of arbitrary re-direction of metadata within the confines of the current v3 spec. My understanding of his argument is that in v3 right now all the top-level metadata that might be needed at store listing time must be in a single self-contained
zarr.json
file per array. If there is some way to get around this within v3 I would be happy to be proven wrong here though!Note this never came up in v2 as none of the above features were present.
Looking forward
We in the geosciences community really really want chunk manifests in zarr because the vast majority of our data is HDF5/netCDF4, and with that once we start treating zarr as an over-arching "Super-Format" we will also have a pretty strong reason to want variable-length chunks. If we cannot do this at scale within v3 in the worst case we may end up going outside of v3 to get these features, which I think is an argument for seeing if we can squeeze something into the spec that would support this at the 11th hour.
Thoughts?
cc @jhamman @manzt @martindurant @jbms @joshmoore @AimeeB
The text was updated successfully, but these errors were encountered: