-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review of the ZEP2 spec - Sharding storage transformer #152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great! Clear, concise, and technically sound.
I have a few very minor editorial comments and suggestions.
Co-authored-by: Ryan Abernathey <[email protected]>
This proposal is really great and would be perfect for GPU driven compression and decompression as well. We have a very similar stream format in nvCOMP and would like to align on this shard format so it is easy to get much better compression and decompression performance with zarr. The main thing that seems to be missing is checksums for checking data integrity. Is this something we can look to add to the shard spec? |
Am I assuming correctly that you want checksums of each individual chunk, not the entire shard? Checksums of individual chunks would, I think, be better handled by a codec that adds a checksum. Note that zarr v3 now allows a list of codecs to be specified. By doing it as a codec, t can also be used equally well independent of sharding. |
Also note that numcodecs already has several checksum codecs: https://numcodecs.readthedocs.io/en/latest/checksum32.html Fletcher32 was also recently added: zarr-developers/numcodecs#412 @akshaysubr I'm curious how you handle checksums today in nvcomp. |
Yes, that's right. We mostly want checksums of each individual chunk. We also compute a checksum of checksums as a proxy for the checksum of the entire shard, but that's not as critical. @rabernat The main reason we wanted checksums is to be able to provide proper (de)compression status errors. It's really hard to detect any errors without an actual checksum especially when we're using compression for network communication and want to be resilient to any errors. We also currently use CRC32 for our checksums but with a small change for performance. Here is what our current stream format looks like. It would be great if we can align on the proposed shard format here so users can directly use nvcomp's high level interface to decompress a shard with all the performance optimizations hidden away in the implementation. We also would like users to be able to use nvcomp independent of zarr in some use cases like network communication where the data never goes to disk. To be able to support that use case, we would need to have some additional shard level metadata. One potential solution to that issue might be to include custom data fields between the end of the chunks and the start of the offset table. Something like this:
Would that work? It will still allow for adding chunks later, just would need to move both the offset table and the custom data fields to the end. Any zarr implementation can be completely oblivious to this since you read the offset table based on the end of the file and number of chunks and you can read each chunk based on the offset and never need to look at the custom data fields. |
Do you have a more precise description of this stream data format? From a quick look at the nvCOMP website I see that you have an existing batch API that would basically let users use any container format they like. I don't know exactly how someone might want to integrate support for NVIDIA GPU compression with a zarr implementation, but in general it seems likely that they would prefer to use something like your existing batch API rather than a higher level API that has knowledge of the container format, unless for some reason having knowledge of the container format in nvCOMP directly provides a performance advantage. In addition, they might for example wish to read only part of a shard and still decompress it on an NVIDIA GPU. That would mean retrieving the headers, and then a subset of the chunks within it (which would mean it would no longer be a single contiguous block in memory in the original container format) and then decoding just those chunks. That would seem to be relatively easy with your existing batch API but challenging with a higher-level container API. As far as checksum storage,I think it would more naturally fit with the decoupling in zarr between codec and sharding to store the checksums (both pre and post compression) adjacent to the compressed data. That would allow you to represent them in the zarr metadata as something like: {
"codecs": [{"id": "crc32"}, {"id": "gzip"}, {"id": "crc32"}]
} What do you see as the advantages to storing the checksums all together separately from the data? |
Zarr supports CRC32 checksums via numcodecs from numcodecs import CRC32
import numpy as np
crc32_codec = CRC32()
data = np.array([1, 2, 3])
encoded = crc32_codec.encode(data)
decoded = crc32_codec.decode(encoded)
np.testing.assert_equal(data, decoded.view(data.dtype))
# modify encoded data and show that the decoding fails
encoded[-1] = 1
crc32_codec.decode(encoded) # -> RuntimeError: The checksum is prepended to the chunk data, and those bytes are passed to the next codec in the pipeline. You can see the implementation yourself here--it's very simple: I can't tell from the information you provided whether that is compatible with your implementation. But that should be straightforward for you to check. Do you have a spec document for your stream format?
This is allowed by the sharding spec. As long as the index is at the end of the file with the expected size and format, it will work. If there is extra data in the file that is not described by the index, Zarr will just ignore it. |
Jeremey, we have been working with the NVIDIA team for a while on use cases around deep learning with climate data. See this blog post for some earlier work. This is part of a larger workflow. That workflow involves
This work to align on formats lies between 3 and 4. If nvCOMP and the NVIDIA toolchain can read Zarr directly (even if they are not using a Zarr API explicitly), that's a much more efficient and flexible workflow. I hope this gives some context for why we are interested in this. |
@jbms I agree, the ideal way to use nvCOMP would be to use the batched API to decompress only the chunks needed. This would be the end goal. There are some performance benefits to decompressing a full shard using the high level interface (HLIF), but the bigger reason to use the high level interface is that we can then have a GPU decompression prototype implemented much faster. As you might expect, designing python wrappers around the LLIF is much more challenging but we already have wrappers for the HLIF here. Overall, these are the capabilities that would be good to support
|
Storing the checksums with the chunk data should work in theory. The main reason to have them be separate is that we can access them more easily on the host to do a comparison when the whole buffer is on the GPU. That way, we don't have to dig through a whole GPU buffer and read 4 bytes at a time, but can read all the checksums at once. This is a relatively minor implementation detail though and a workaround should be possible. |
@JackKelly you might be interested in this converation. |
Hi @akshaysubr, glad that this proposal fits your use-case!
IMO that's already allowed according to the specification, but there's not a direct recommendation to add any API for writing or reading the custom data fields. If you would write such shards via your own implementation any other implementation that supports this spec should be able to read the chunk data. Would that be enough for your use-case?
I agree with Jeremy that adding the checksums to the chunk data would also allow other implementations to easily verify the checksums, as that fits the zarr data model at this point. Using the custom data fields would mean that only specific implementations could use this for now. However, another extension/spec might also introduce checksumming specifically for sharding (this could be built in a backwards-compatible manner on top of this spec). Since storing the information is already allowed via the current spec I would rather not add extra logic for checksumming into this ZEP to keep the scope narrow. Do you agree @akshaysubr? |
@jstriebel Yes, that works for our use case. The only remaining concern from our POV would be that a shard with this spec is not a standalone container in the sense that you cannot decompress a shard without additional metadata. We can always just add the required metadata to a custom data field, but wondering if it would make sense to standardize this. I think a good addition to this spec would be to add some very lightweight shard level metadata: compression codec used, shard size and chunk size. One concrete use case for this, apart from the nvCOMP one, is that in some genomics datasets, we would like different shards to use different compressors since the structure of the data changes quite a bit. Addition of some shard level metadata would help in this case. Without this metadata, it also becomes a bit cumbersome to get to the custom data fields since you might not know how many chunks are in the shard and so you'd have to read 8 byte words starting from the end and going back until you hit some magic number or something that indicates the end of the offset table.
Yes, this makes total sense. Having the custom data field enables adding such features explicitly down the line, so makes sense to leave it out of this ZEP. |
@akshaysubr Interesting! That's something zarr doesn't handle atm, and would need additional logic. I could imagine this metadata not only for shards, but potentially also chunks. To standardize this and allow other implementations to use it as well I'd propose to add a ZEP for an extension that could handle such metadata in shards and/or chunks. Also, caterva and blosc2 might be worth a look, caterva defines a header for chunks (called blocks in caterva) and shards (called chunks), and the blosc2 chunks contain information about their codecs. It seems like caterva became a part of blosc2 recently. Related discussions are zarr-developers/zarr-python#713 and zarr-developers/numcodecs#413. |
@jstriebel I agree, we can maybe propose an extension down the line when we have an implementation to see if the per shard metadata can be standardized. For now, we can use the custom data field to store that information. Here's what we are currently considering for how nvCOMP can use this shard format:
This should allow any shard comprerssed using nvCOMP to be readable by any other zarr compatible reader. The only tricky thing here is that given just the shard without the overall context, we don't know how to find out how many chunks are in the shard. We can fix that by adding a 16 byte nvcomp magic number comprised of two 8 byte values It would be convenient to store the number of chunk at the end of the shard after the offset table. That way, any implementation wishing to put any custom data can find it easily without having to do this backtracking search for a magic number. Something like this
Thoughts? |
If a chunk count field is useful, then it is a pretty minimal addition, and the extra 8 bytes is a pretty small cost to pay when not needed. But I do still have some questions more generally:
I think if you describe some use cases / user stories for decoding a zarr shard without knowledge of the zarr metadata, it would help to clarify these issues. |
It would be good to compare the chunk index to a HDF5 fixed array index, especially the data block. https://docs.hdfgroup.org/hdf5/develop/_f_m_t3.html#FixedArray The fixed array data block consists of a 14 byte header, followed by data block elements, followed by a 4 byte checksum (Jenkin's lookup3). Each data block element consists of an address, chunk size, and filter mask. This is very similar to the chunk index here with an 8 byte address and an 8 nbytes field. There is an additional 4 byte filter mask field. The filter mask indicates whether filters (e.g. compression) have been applied or not. The besides the header structures and the checksum, the main difference here is the additonal per chunk filter mask. Another potential difference is that the size field, nbytes here, can be smaller. The size of the size, nbytes, field is determined by a 28 header as the "entry size". The Jenkin's lookup3 checksum is described here: In summary, the main difference between the proposed chunk index is a 4 byte filter mask per chunk. Another difference is that the size of the nbytes field can be smaller than 8 bytes. Additionally, there is the continuing theme of using checksums to ensure data integrity. |
Thanks for this analysis @mkitti.
|
I'm not sure if each "sequential read" includes parsing or just grabbing the bytes. If we considered 64-bits to be the largest possible of nbytes, we could always read enough of the footer assuming 64-bit nbytes, and then evaluate later if the index may be smaller. That said alternating 64-bit and say 32-bit integers is not as friendly to SIMD based processing as just a list of 64-bit integers. The main advantage of using a smaller field for nbytes is that the index could be smaller in size if the chunks are not very large. Another possible advantage of the HDF5 spec here is the ability to also declare a larger field size (128-bit or 256-bit integer sizes) if needed for both addreses and the number of bytes in a chunk. This might be a good move to "future" proof the spec. Admittedly data at this scale is difficult for me to imagine today. partially discussed at the community meeting |
I suppose there are lots of things to potentially optimize for. If the inner chunks are greater than 16KB then the space overhead of the current index will be less than 0.1%, and therefore it seemed unlikely to be particularly important to optimize it further. But I suppose if the chunks are very small, or many are not present, or the overall amount of data is huge and you care about large absolute savings even if they are proportionally tiny, then the size of the index becomes more of a concern. If using a variable-length index, there are a lot of ways to potentially optimize the index size:
But it is not clear that the reduced storage size would be worth the added complexity of these changes. One advantage of the current fixed layout of the index is that a zarr implementation could potentially read just a portion of the shard index, though it would then not be possible to verify the checksum, and likely it would be better to just use nested sharding rather than partial reading of the shard index.
Because we have a json representation of the codec metadata, it is easy to add additional attributes later to specify a different representation of the offset or size fields, so I don't think there is a need to add something now for future proofing. In general I expect that once a zarr implementation has implemented sharding, additional supporting other shard index formats would require minimal added effort. |
See #237 for a PR to add a checksum to the index. |
The main application is handling codec failures while writing shards from a volatile buffer. For example, a sensor is writing data to a circular buffer which is being chunked and compressed into a Zarr shard. We have a limited time to move the data out of the circular buffer before the data is overwritten. During this process the compression codec could fail. Possible codec failures include:
At the moment it is unclear how to handle a failure in chunk processing within a Zarr shard. If we have a fixed order chunk indexes, this suggests that the index cannot omit a failing chunk entirely. We would also lose the primary data if we do not write the chunk to some location. If a codec fails to apply to one chunk, is the entire shard invalid? How do we differentiate an empty chunk from a chunk that failed codec processing? The simplest failure recovery is to write the raw chunk data in lieu of applying the codec. I propose we allocate a single bit per chunk to indicate a codec failure and that the data represents the raw, unprocessed chunk data. This is significantly simpler than HDF5's 32 bits, one for each filter in a pipeline. The single bit simply indicates whether the chunk has been processed or not. This bit could be the highest bit of A simple implementation would be to regard We previously proposed a value of Another option is to change the sentinel value for an empty chunk to different value. Being able to mark some chunks as unprocessed would allow progressive processing of chunks in a shard. This would allow chunks to be individually compressed over time rather than requiring all chunks in a shard to be compressed at once. Additionally, allowing some chunks to be unprocessed could allow for performance optimization. There may be some chunks where the compression ratio is poor enough that the decompression time is not worthwhile. The alternative as @jbms pointed out is to use a separate codec that adds a header to the chunk. For example, for each codec, we create an "Optional" version of the codec that has a one byte header which indicates if the codec was successfully applied or not. Now that the checksum has been proposed one difference between the "raw bit" scheme and the "optional codec" is that the "raw bit" is incorporated into the checksum. The other obvious difference is that chunks would be limited to ~9 exabytes rather than ~18 exabytes. |
@mkitti Thanks for the explanation. To be clear, the sentinel value of In any case, it is true that just as with plain zarr when not using the sharded format, there is no way to distinguish between a missing chunk that has simply not been written, and a chunk for which an attempt to write it was made, but an error occurred and it could not be written. You would need a separate way to indicate that, unless it is expected that the array will be written densely and therefore any missing chunk indicates an error. The intention with sharding is to improve read and write efficiency but not to otherwise change the zarr data model. It seems to me that this issue of wanting to store some chunks uncompressed is orthogonal to sharding, and therefore it would be cleaner to support it as a separate codec rather than trying to integrate it into sharding. That way it can also be used when not using the sharded format. |
@ajelenak will be visiting @JaneliaSciComp on Tuesday July 11th, and I'm settting aside some time to look over this specification and discuss it. Preliminarily, I'm writing this in for some time between 10:30 am and noon ET on Tuesday July 11th. If anyone is interested in joining that discussion remotely, please let me know via email at [email protected]. |
I've done a proof-of-concept test that relocates a HDF5 file Fixed Address Data Block (FADB) to the end of the file such that it can be reused as the chunk index table for a Zarr shard: This allows a file to be simultaneously a HDF5 file and a Zarr shard according to this specification without duplicating the chunk offsets and nbyte sizes. The main matter to discuss at the end is the checksum. HDF5 chunk sizes are limited to 4 bytes (32-bits), meaning that chunks can only be as big as 4 GB. For small chunks, HDF5 might encode the chunk sizes in a fewer number of bytes. In the general case, some byte padding may be needed. The table below is a key for the FADB bytes at the end of the HDF5 file that I created and manipulated.
Interpreted by the current ZEP2 specification with added checksum, the offsets and nbyte sizes will be interpretted correctly due to little endian order when the filter mask bits are all
The main outstanding matter is the four byte checksum which uses Bob Jenkin's lookup3 hash function. The implementation is widely available and packaged across many languages.
I can work on a pull request to have Jenkin's lookup3 added to numcodecs. Would it be possible to have the checksum specified as any 32-bit checksum codec rather than just crc32 as in #237 ? |
@mkitti this is super cool! I love the idea that a Zarr shard can be a valid HDF file!
This seems easily resolvable. |
Would the hdf5 shard files need to have an |
I don't believe the hdf5 library requires a particular extension. As for supporting different checksums, I think it would be reasonable to say that the index is logically a uint64 array of shape "index_codecs": [{"name":"endian", "endian":"little"},{"name":"crc32c"}] It would be required that the index_codecs chain produce a fixed size output since the implementation needs to know how many bytes must be read from the end. This would address some other issues that have been raised, like storing the offsets and lengths as uint32 --- that could now be done with an array->array data type conversion codec. Still, we might want to consider if this added complexity is worthwhile. |
No. Most of the HDF5 files I have encountered in the wild use a different extension. For example,
The main thing needed for a valid HDF5 file is a 48 byte superblock at offset 0, 512, 1024, 2048, or some doubling of bytes thereof within the file. This primarily affects the beginning of the file and is not of direct consequence to the specification here since offsets do not need to start at |
@jbms brought up a good point that the HDF5 jenkin's lookup3 checksum includes the fourteen prior bytes |
I was looking through the HDF5 commit history and found this note from @qkoziol :
It appears that before introducing this change HDF5 used a mix of crc32 and fletcher32 depending on the size being hashed. |
I'm not an expert on hashing but my understanding is that crc32c has hardware support on x86_64 and armv8 and is therefore pretty fast on those platforms. It also seems to be the default checksum choice for a lot of purposes. But separate from the question of exactly which checksums to support, it may make sense to add support for an |
True, although there were no releases of HDF5 that used those algorithms to checksum file metadata. The algorithms were added to the library during development of features for the 1.8 releases and replaced with the lookup3 algorithm before the 1.8.0 release. |
Closing this review PR in favor of a new issue for the vote about ZEP2 @normanrz is creating later, to separate the voting from this long discussion phase. The initial proposal was changed in response to many comments here. |
The review and discussion will continue here: #254 |
This pull request includes the Zarr sharding extension specification as proposed in ZEP0002.
Reviews of this specification are welcome! If possible, please submit comments by using the normal GitHub code review functionality.
Technical note: This PR has been submitted against a "void" branch with no content to allow reviewing these files, even though they are partially already available on the
main
branch. Changes to this branch will be cherry-picked onto the main branch as well. The files in this PR reflect the status of PR #151, slightly deviating from the content onmain
.