Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing to add / Adding a sharding spec #127

Closed
jstriebel opened this issue Feb 7, 2022 · 15 comments
Closed

Allowing to add / Adding a sharding spec #127

jstriebel opened this issue Feb 7, 2022 · 15 comments

Comments

@jstriebel
Copy link
Member

As discussed in zarr-developers/zarr-python#877, it would be great to have sharding support in zarr. The current prototype is based around translating chunk-access into partial-read/writes of shards, where shards are one unit in a store. This needs some translation logic between an array chunk and a storage unit (plus making sure that this can happen efficiently, e.g. by bundling multiple chunks into one access-request). Atm there are prototypes for this logic to be either

I think it would be great to extend the specification to allow sharding, by either

  • adding a specific sharding spec and terminology, or
  • adding a general "translation" interface, with the specific extension for sharding.

The second approach would hopefully lead to a standard way how to include such extensions. They still need broad compatibility and support across implementations, but having a common terminology and specs for similar extensions might be worthwhile. Related issues seem to be #62, #115, #82, #49, #76.

I'm not quite familiar with the process around the specs. Besides opinions about the approaches and terminology, I'd be happy about practical next steps to update the specs for sharding support.

@rabernat
Copy link
Contributor

rabernat commented Feb 7, 2022

Thanks so much @jstriebel for your excellent and sustained work on this.

I'm not quite familiar with the process around the specs.

That is probably because we have not managed to determine a clearly defined process for updating the spec yet - see zarr-developers/governance#14 for discussion. 🤦 The central challenge is that the existing spec was basically just written by one person (@alimanfoo), whereas now we have many different stakeholders and implementors to try to align.

At our last Zarr SC meeting, I volunteered myself to be the one to make a proposal for a concrete way forward here.

My initial question for you would be whether you are okay with the sharding stuff being part of the Zarr V3 spec, and not touching V2?. My feeling is that such a major change is simply not backwards compatible with existing V2 implementations and therefore requires a major version change. Ideally V3 will formally allow protocol extensions (see #49), of which sharding may be considered one example.

If so, I will move forward with a specific process proposal for expanding the V3 spec to support sharding via an extension.

@jakirkham
Copy link
Member

cc @joshmoore

@jbms
Copy link
Contributor

jbms commented Feb 7, 2022

@rabernat I would absolutely agree that sharding needs to be part of a new version of zarr, since it is definitely not backwards compatible .

However, I think it would be very valuable to have this sharding standardized in the near term so that it can be used without risk of creating interoperability problems.

My current uninformed perspective is that zarr v3 is very much in flux, and by making sharding a part of zarr v3 rather than say zarr 2.5 (i.e. zarr v2 plus sharding support), that would mean that any actual use would be postponed indefinitely.

On the other hand if there is a core portion of zarr v3 that is ready to be finalized as soon as sharding is ready then it would be reasonable to combine them.

@jakirkham
Copy link
Member

Already there are a couple implementations of the spec. Namely zarrita & xtensor-zarr.

While yes there are still changes happening in v3, that is something this particular proposal is doing as well.

Personally would push for getting this in v3. Trying to split efforts amongst different spec versions is likely to move slower than simply focusing on one spec.

Additionally this can be viewed as a "carrot" to motivate people to move to the new spec.

@alimanfoo
Copy link
Member

Hi all, here's my 2c, please take with a pinch of salt as I'm not fully up to speed with all the latest developments.

I think it would be worth doing a round of further work on the zarr core protocol v3 spec to accommodate sharding. Before this work came along I had considered the v3 spec basically done, but in hindsight I'm glad we didn't quite manage to "finish it" until now. As John says this will be a good additional incentive to migrate to v3.

The core protocol v3 spec would need to be modified to introduce new concepts and terminology that allow for a decoupling between encoded chunks and storage objects, i.e., that allow for multiple encoded chunks to be stored in the same storage object. This includes some term that captures the idea of a protocol for organising encoded chunks into storage objects, something like "sharding scheme" or "storage scheme". Let's use "sharding scheme" for the time being.

The core protocol spec would then need to define the simple sharding scheme which involves one encoded chunk per storage object, and provide a hook in the array metadata for declaring which sharding scheme is in use. Other sharding schemes, such as the one prototyped at alimanfoo/zarrita#40, could then be defined via protocol extensions.

An analogy is how the core protocol spec defines the general concept of "chunk grids", then defines the simplest case of "regular chunk grids", and leaves definition of other grid types such as rectilinear grids to protocol extensions. The core protocol spec also defines a hook in the array metadata to declare which grid type is in use and describes how to refer to grid types defined in a protocol extension.

Alongside this work on the core protocol spec, I'd suggest developing a protocol extension spec which defines the sharding scheme that is prototyped at alimanfoo/zarrita#40.

A question for how to modify the core protocol spec is whether we can keep the concept of stores and the abstract store interface as-is, and just introduce an additional abstraction layer that sits in front of the store and deals with how encoded chunks are organised into storage objects; or whether store interface would need some changes. I'd guess at least some additional method that allows retrieving a byte range of an object would be needed.

In any case, the storage protocol section would definitely need some reworking to accommodate sharding.

I hope that's useful 😄

@jbms
Copy link
Contributor

jbms commented Feb 7, 2022

From a quick look at the zarr v3 spec I don't see any implementation issues. I do have a few comments --- however, per the comment by @rabernat it is unclear if there is a process for considering such comments.

Once the format is used for non-experimental purposes, then it is effectively frozen by one means or another. I see that zarr v3 has been under development for several years but is it plausible that this freezing would happen soon?

@alimanfoo
Copy link
Member

Once the format is used for non-experimental purposes, then it is effectively frozen by one means or another. I see that zarr v3 has been under development for several years but is it plausible that this freezing would happen soon?

Yes I think we all want to freeze it ASAP, or at least publish it in a beta-like state, meaning we only fix errors and make clarifications.

We do still need to work out the process for developing and publishing specs. In the interim I'd suggest (1) any comments on the current v3 core protocol spec draft are captured as issues on this (zarr-specs) repo, and (2) we ask if anyone would be willing to volunteer to do a round of editing on the v3 core protocol spec to accommodate sharding into the protocol, which could be submitted as a PR on this repo.

@jstriebel
Copy link
Member Author

Thanks everyone!

My initial question for you would be whether you are okay with the sharding stuff being part of the Zarr V3 spec, and not touching V2?. My feeling is that such a major change is simply not backwards compatible with existing V2 implementations and therefore requires a major version change. Ideally V3 will formally allow protocol extensions (see #49), of which sharding may be considered one example.

From our perspective at scalableminds adding this to v3 would be fine. We're most interested in a broad agreement. Adding this as an optional extension to v2 would be nice-to-have, but not that important. At least for zarr-python adding basic functionality (without high efficiency) is already possible via a custom store.

For specific changes needed in the spec:

I'd guess at least some additional method that allows retrieving a byte range of an object would be needed.

Agreed! Additionally I'd propose to add methods for read/write/delete on multiple chunks, so that operations within a shard can be bundled. (This already used to be part of alimanfoo/zarrita#40 before I simplified it, see e.g. here & here).

We do still need to work out the process for developing and publishing specs. In the interim I'd suggest (1) any comments on the current v3 core protocol spec draft are captured as issues on this (zarr-specs) repo, and (2) we ask if anyone would be willing to volunteer to do a round of editing on the v3 core protocol spec to accommodate sharding into the protocol, which could be submitted as a PR on this repo.

So changes would basically be a PR against the core-protocol-v3.0-dev branch? I don't have experiences in writing formal specs, but I can try to provide a first draft.

I'll join the biweekly community tomorrow (Wednesday 7PM London time, see zarr-developers/community#1), happy to discuss this in person as well.

@DennisHeimbigner
Copy link

I want to make sure I understand the purpose of sharding. Correct me if I am confused.

From the discussion on the Zarr community meet on 2/9/22, I gather the primary purpose is performance. The performance improvement appears to come from reading multiple chunks (a shard) at one time into memory.

To a first approximation, this appears to be equivalent to defining a larger chunk size.
Here, for example zarr-developers/zarr-python#876, we see the set of contained chunks are equivalent to defining a larger chunk twice the size in each dimension.
They form a simple pattern: (0,0), (0,1), (1.0), (1,1). Presumably this makes it easy to map a chunk key to the containing shared.

The one differences appear to be WRT compression. Compression still operates at the chunk level as opposed to compressing the whole shared, thus allowing for un-compressing a part of a shard without having to un-compress the whole shard.

Do I understand this correctly?

@jbms
Copy link
Contributor

jbms commented Feb 10, 2022

The primary purpose is storage efficiency and write efficiency more than read efficiency/performance. Many storage systems that users may wish to use with zarr, such as S3, GCS, and various distributed filesystems, do not efficiently handle large numbers of small "files". Even local Linux filesystems are not especially efficient for many small files. For example, on some distributed filesystems, files smaller than ~16MB may still consume 16MB of space. And the large number of metadata operations required to write so many small files may consume an excessive amount of resources from the distributed filesystem, leading to throttling.

On GCS, for example, things work moderately okay even with a large number of small files. But you do pay higher costs for all of the class A operations writing all of those small files. And if you attempt to copy, move, or delete a zarr volume represented by millions or tens of millions of files, it can take an extraordinarily long time.

Of course you can view this as purely a concern at the storage layer that should not be addressed by zarr, e.g. you could use a custom store that implements the sharding and not make that part of zarr itself. But in order to allow interoperability between software, and to be able to concisely refer to the location of a zarr array, we need to standardize the sharding format and have a way of recording along with the array the information about how to access it. We want to be able to say: "go read the zarr3 array at this path", not "go read the zarr3 array at this path, and do it using this sharding format with these sharding parameters".

Simply increasing the chunk size to avoid small files helps with storage and write efficiency, but then read efficiency may greatly suffer, as you may have to read the data at much larger granularity than desired.

@normanrz
Copy link
Member

Here is an example of one of our average datasets to illustrate the need for sharding:
The dataset has a shape of (25000, 18000, 6000) and a dtype of uint8. This makes for a total size of 2.4TB. For efficient reads and fast streaming of the data to the webKnossos browser-based interface, we need chunk sizes of max. (64, 64, 64). For this dataset that would result in 10.3M files. Considering that our webKnossos installations have many of these datasets, the number of files becomes impractical for the reasons @jbms mentioned. With sharding, we would use shards of size (2048, 2048, 2048) which would each contain (32, 32, 32) chunks. That would reduce the number of files to ~300, while maintaining a practical write granularity.

@constantinpape
Copy link

Just my 2cents: I think sharding would be a great feature to have in zarr and help significantly with big data in the cloud (and on distributed file systems), for the reasons @jbms and @normanrz laid out. So I am all for having this in the v3 spec and I think it would be a very big incentive for adopting it for many communities working with large data.

@DennisHeimbigner
Copy link

Ok, the above argument #127 (comment) convinces me. Sorry if I sidetracked things.

@jstriebel
Copy link
Member Author

The first draft of the spec is ready for review in PR #134, as discussed in the community meeting two weeks ago. Sorry for the delay 🙏

@jstriebel
Copy link
Member Author

Since ZEP 2 is on it's way, I'm going to close this issue to narrow down the open threads about sharding. However, further feedback and comments are very welcome! Either

Thanks everyone enabling the ZEP process and helping to move this forward 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants