Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review of the ZEP2 spec - Sharding storage transformer #152

Closed
wants to merge 3 commits into from

Conversation

jstriebel
Copy link
Member

@jstriebel jstriebel commented Aug 4, 2022

This pull request includes the Zarr sharding extension specification as proposed in ZEP0002.

Reviews of this specification are welcome! If possible, please submit comments by using the normal GitHub code review functionality.

Technical note: This PR has been submitted against a "void" branch with no content to allow reviewing these files, even though they are partially already available on the main branch. Changes to this branch will be cherry-picked onto the main branch as well. The files in this PR reflect the status of PR #151, slightly deviating from the content on main.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great! Clear, concise, and technically sound.

I have a few very minor editorial comments and suggestions.

@akshaysubr
Copy link

This proposal is really great and would be perfect for GPU driven compression and decompression as well. We have a very similar stream format in nvCOMP and would like to align on this shard format so it is easy to get much better compression and decompression performance with zarr. The main thing that seems to be missing is checksums for checking data integrity. Is this something we can look to add to the shard spec?

@jbms
Copy link
Contributor

jbms commented Jan 25, 2023

This proposal is really great and would be perfect for GPU driven compression and decompression as well. We have a very similar stream format in nvCOMP and would like to align on this shard format so it is easy to get much better compression and decompression performance with zarr. The main thing that seems to be missing is checksums for checking data integrity. Is this something we can look to add to the shard spec?

Am I assuming correctly that you want checksums of each individual chunk, not the entire shard?

Checksums of individual chunks would, I think, be better handled by a codec that adds a checksum. Note that zarr v3 now allows a list of codecs to be specified. By doing it as a codec, t can also be used equally well independent of sharding.

@rabernat
Copy link
Contributor

Checksums of individual chunks would, I think, be better handled by a codec that adds a checksum.

Also note that numcodecs already has several checksum codecs: https://numcodecs.readthedocs.io/en/latest/checksum32.html

Fletcher32 was also recently added: zarr-developers/numcodecs#412

@akshaysubr I'm curious how you handle checksums today in nvcomp.

@akshaysubr
Copy link

Am I assuming correctly that you want checksums of each individual chunk, not the entire shard?

Yes, that's right. We mostly want checksums of each individual chunk. We also compute a checksum of checksums as a proxy for the checksum of the entire shard, but that's not as critical.

@rabernat The main reason we wanted checksums is to be able to provide proper (de)compression status errors. It's really hard to detect any errors without an actual checksum especially when we're using compression for network communication and want to be resilient to any errors. We also currently use CRC32 for our checksums but with a small change for performance.

Here is what our current stream format looks like. It would be great if we can align on the proposed shard format here so users can directly use nvcomp's high level interface to decompress a shard with all the performance optimizations hidden away in the implementation.
Screen Shot 2023-02-01 at 10 39 22 AM

We also would like users to be able to use nvcomp independent of zarr in some use cases like network communication where the data never goes to disk. To be able to support that use case, we would need to have some additional shard level metadata. One potential solution to that issue might be to include custom data fields between the end of the chunks and the start of the offset table. Something like this:

| chunk (0, 0)    | chunk (0, 1)    | chunk (1, 0)    | chunk (1, 1)    |
| ------------- optional custom data fields ------------------ |
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |

Would that work? It will still allow for adding chunks later, just would need to move both the offset table and the custom data fields to the end. Any zarr implementation can be completely oblivious to this since you read the offset table based on the end of the file and number of chunks and you can read each chunk based on the offset and never need to look at the custom data fields.

@jbms
Copy link
Contributor

jbms commented Feb 1, 2023

Am I assuming correctly that you want checksums of each individual chunk, not the entire shard?

Yes, that's right. We mostly want checksums of each individual chunk. We also compute a checksum of checksums as a proxy for the checksum of the entire shard, but that's not as critical.

@rabernat The main reason we wanted checksums is to be able to provide proper (de)compression status errors. It's really hard to detect any errors without an actual checksum especially when we're using compression for network communication and want to be resilient to any errors. We also currently use CRC32 for our checksums but with a small change for performance.

Here is what our current stream format looks like. It would be great if we can align on the proposed shard format here so users can directly use nvcomp's high level interface to decompress a shard with all the performance optimizations hidden away in the implementation. Screen Shot 2023-02-01 at 10 39 22 AM

We also would like users to be able to use nvcomp independent of zarr in some use cases like network communication where the data never goes to disk. To be able to support that use case, we would need to have some additional shard level metadata. One potential solution to that issue might be to include custom data fields between the end of the chunks and the start of the offset table. Something like this:

| chunk (0, 0)    | chunk (0, 1)    | chunk (1, 0)    | chunk (1, 1)    |
| ------------- optional custom data fields ------------------ |
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |

Would that work? It will still allow for adding chunks later, just would need to move both the offset table and the custom data fields to the end. Any zarr implementation can be completely oblivious to this since you read the offset table based on the end of the file and number of chunks and you can read each chunk based on the offset and never need to look at the custom data fields.

Do you have a more precise description of this stream data format?

From a quick look at the nvCOMP website I see that you have an existing batch API that would basically let users use any container format they like. I don't know exactly how someone might want to integrate support for NVIDIA GPU compression with a zarr implementation, but in general it seems likely that they would prefer to use something like your existing batch API rather than a higher level API that has knowledge of the container format, unless for some reason having knowledge of the container format in nvCOMP directly provides a performance advantage.

In addition, they might for example wish to read only part of a shard and still decompress it on an NVIDIA GPU. That would mean retrieving the headers, and then a subset of the chunks within it (which would mean it would no longer be a single contiguous block in memory in the original container format) and then decoding just those chunks. That would seem to be relatively easy with your existing batch API but challenging with a higher-level container API.

As far as checksum storage,I think it would more naturally fit with the decoupling in zarr between codec and sharding to store the checksums (both pre and post compression) adjacent to the compressed data. That would allow you to represent them in the zarr metadata as something like:

{
  "codecs": [{"id": "crc32"}, {"id": "gzip"}, {"id": "crc32"}]
}

What do you see as the advantages to storing the checksums all together separately from the data?

@rabernat
Copy link
Contributor

rabernat commented Feb 1, 2023

We also currently use CRC32 for our checksums but with a small change for performance.

Zarr supports CRC32 checksums via numcodecs

from numcodecs import CRC32
import numpy as np

crc32_codec = CRC32()

data = np.array([1, 2, 3])
encoded = crc32_codec.encode(data)
decoded = crc32_codec.decode(encoded)

np.testing.assert_equal(data, decoded.view(data.dtype))

# modify encoded data and show that the decoding fails
encoded[-1] = 1
crc32_codec.decode(encoded)  # -> RuntimeError:

The checksum is prepended to the chunk data, and those bytes are passed to the next codec in the pipeline. You can see the implementation yourself here--it's very simple:

https://github.com/zarr-developers/numcodecs/blob/350cbd1eb6751cb32e66aa5a8bc4002077083a83/numcodecs/checksum32.py#L16-L30

I can't tell from the information you provided whether that is compatible with your implementation. But that should be straightforward for you to check. Do you have a spec document for your stream format?

One potential solution to that issue might be to include custom data fields between the end of the chunks and the start of the offset table

This is allowed by the sharding spec. As long as the index is at the end of the file with the expected size and format, it will work. If there is extra data in the file that is not described by the index, Zarr will just ignore it.

@rabernat
Copy link
Contributor

rabernat commented Feb 1, 2023

From a quick look at the nvCOMP website I see that you have an existing batch API that would basically let users use any container format they like. I don't know exactly how someone might want to integrate support for NVIDIA GPU compression with a zarr implementation...

Jeremey, we have been working with the NVIDIA team for a while on use cases around deep learning with climate data. See this blog post for some earlier work.

This is part of a larger workflow. That workflow involves

  1. reading satellite or climate model data with Xarray
  2. performing some transformations
  3. writing to object storage in Zarr format
  4. streaming the Zarr data into GPUs for deep learning model training

This work to align on formats lies between 3 and 4. If nvCOMP and the NVIDIA toolchain can read Zarr directly (even if they are not using a Zarr API explicitly), that's a much more efficient and flexible workflow.

I hope this gives some context for why we are interested in this.

@akshaysubr
Copy link

From a quick look at the nvCOMP website I see that you have an existing batch API that would basically let users use any container format they like. I don't know exactly how someone might want to integrate support for NVIDIA GPU compression with a zarr implementation, but in general it seems likely that they would prefer to use something like your existing batch API rather than a higher level API that has knowledge of the container format, unless for some reason having knowledge of the container format in nvCOMP directly provides a performance advantage.

@jbms I agree, the ideal way to use nvCOMP would be to use the batched API to decompress only the chunks needed. This would be the end goal. There are some performance benefits to decompressing a full shard using the high level interface (HLIF), but the bigger reason to use the high level interface is that we can then have a GPU decompression prototype implemented much faster. As you might expect, designing python wrappers around the LLIF is much more challenging but we already have wrappers for the HLIF here.

Overall, these are the capabilities that would be good to support

  1. Be able to decompress any zarr shard that was compressed on the CPU but with a format that is supported by nvCOMP (this means no custom nvCOMP codec, just an LZ4 or a deflate/zlib codec)
  2. Be able to compress and store a zarr shard using nvCOMP and decompress that with the corresponding CPU codec
  3. Be able to decompress an nvCOMP compressed zarr shard using nvCOMP without needing to read the higher level header. This is strictly an nvCOMP requirement but would allow for this zarr shard format to be a standalone container format if necessary in other use cases and potentially if you want to have each shard compressed with a different codec. In some ways, I'm thinking of this similar to how xarray works with zarr in that xarray imposes certain additional constraints on the zarr store without imposing it on all other use cases for zarr.

@akshaysubr
Copy link

As far as checksum storage,I think it would more naturally fit with the decoupling in zarr between codec and sharding to store the checksums (both pre and post compression) adjacent to the compressed data. That would allow you to represent them in the zarr metadata as something like:
{
"codecs": [{"id": "crc32"}, {"id": "gzip"}, {"id": "crc32"}]
}
What do you see as the advantages to storing the checksums all together separately from the data?

Storing the checksums with the chunk data should work in theory. The main reason to have them be separate is that we can access them more easily on the host to do a comparison when the whole buffer is on the GPU. That way, we don't have to dig through a whole GPU buffer and read 4 bytes at a time, but can read all the checksums at once. This is a relatively minor implementation detail though and a workaround should be possible.

@rabernat
Copy link
Contributor

rabernat commented Feb 2, 2023

@JackKelly you might be interested in this converation.

@jstriebel
Copy link
Member Author

Hi @akshaysubr, glad that this proposal fits your use-case!

One potential solution to that issue might be to include custom data fields between the end of the chunks and the start of the offset table.

IMO that's already allowed according to the specification, but there's not a direct recommendation to add any API for writing or reading the custom data fields. If you would write such shards via your own implementation any other implementation that supports this spec should be able to read the chunk data. Would that be enough for your use-case?

As far as checksum storage,I think it would more naturally fit with the decoupling in zarr between codec and sharding to store the checksums (both pre and post compression) adjacent to the compressed data. That would allow you to represent them in the zarr metadata as something like:
{
"codecs": [{"id": "crc32"}, {"id": "gzip"}, {"id": "crc32"}]
}
What do you see as the advantages to storing the checksums all together separately from the data?

Storing the checksums with the chunk data should work in theory. The main reason to have them be separate is that we can access them more easily on the host to do a comparison when the whole buffer is on the GPU. That way, we don't have to dig through a whole GPU buffer and read 4 bytes at a time, but can read all the checksums at once. This is a relatively minor implementation detail though and a workaround should be possible.

I agree with Jeremy that adding the checksums to the chunk data would also allow other implementations to easily verify the checksums, as that fits the zarr data model at this point. Using the custom data fields would mean that only specific implementations could use this for now. However, another extension/spec might also introduce checksumming specifically for sharding (this could be built in a backwards-compatible manner on top of this spec). Since storing the information is already allowed via the current spec I would rather not add extra logic for checksumming into this ZEP to keep the scope narrow. Do you agree @akshaysubr?

@akshaysubr
Copy link

IMO that's already allowed according to the specification, but there's not a direct recommendation to add any API for writing or reading the custom data fields. If you would write such shards via your own implementation any other implementation that supports this spec should be able to read the chunk data. Would that be enough for your use-case?

@jstriebel Yes, that works for our use case. The only remaining concern from our POV would be that a shard with this spec is not a standalone container in the sense that you cannot decompress a shard without additional metadata. We can always just add the required metadata to a custom data field, but wondering if it would make sense to standardize this. I think a good addition to this spec would be to add some very lightweight shard level metadata: compression codec used, shard size and chunk size. One concrete use case for this, apart from the nvCOMP one, is that in some genomics datasets, we would like different shards to use different compressors since the structure of the data changes quite a bit. Addition of some shard level metadata would help in this case.

Without this metadata, it also becomes a bit cumbersome to get to the custom data fields since you might not know how many chunks are in the shard and so you'd have to read 8 byte words starting from the end and going back until you hit some magic number or something that indicates the end of the offset table.

Since storing the information is already allowed via the current spec I would rather not add extra logic for checksumming into this ZEP to keep the scope narrow.

Yes, this makes total sense. Having the custom data field enables adding such features explicitly down the line, so makes sense to leave it out of this ZEP.

@jstriebel
Copy link
Member Author

The only remaining concern from our POV would be that a shard with this spec is not a standalone container in the sense that you cannot decompress a shard without additional metadata. We can always just add the required metadata to a custom data field, but wondering if it would make sense to standardize this. I think a good addition to this spec would be to add some very lightweight shard level metadata: compression codec used, shard size and chunk size.

@akshaysubr Interesting! That's something zarr doesn't handle atm, and would need additional logic. I could imagine this metadata not only for shards, but potentially also chunks. To standardize this and allow other implementations to use it as well I'd propose to add a ZEP for an extension that could handle such metadata in shards and/or chunks.

Also, caterva and blosc2 might be worth a look, caterva defines a header for chunks (called blocks in caterva) and shards (called chunks), and the blosc2 chunks contain information about their codecs. It seems like caterva became a part of blosc2 recently. Related discussions are zarr-developers/zarr-python#713 and zarr-developers/numcodecs#413.

@akshaysubr
Copy link

@jstriebel I agree, we can maybe propose an extension down the line when we have an implementation to see if the per shard metadata can be standardized. For now, we can use the custom data field to store that information.

Here's what we are currently considering for how nvCOMP can use this shard format:

| chunk (0, 0)    | chunk (0, 1)    | chunk (1, 0)    | chunk (1, 1)    |
| metadata | metadata size |
| nvcomp magic number |
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |

This should allow any shard comprerssed using nvCOMP to be readable by any other zarr compatible reader. The only tricky thing here is that given just the shard without the overall context, we don't know how to find out how many chunks are in the shard. We can fix that by adding a 16 byte nvcomp magic number comprised of two 8 byte values offset and nbytes that are larger than the size of the buffer. That way, we can search for an invalid (offset, nbytes) starting from the end of the stream going back to discover how many chunks there are and also to find the metadata.

It would be convenient to store the number of chunk at the end of the shard after the offset table. That way, any implementation wishing to put any custom data can find it easily without having to do this backtracking search for a magic number. Something like this

| chunk (0, 0)    | chunk (0, 1)    | chunk (1, 0)    | chunk (1, 1)    |
| metadata | metadata size |
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |
| nchunks |
| uint64  |

Thoughts?

@jbms
Copy link
Contributor

jbms commented Feb 17, 2023

If a chunk count field is useful, then it is a pretty minimal addition, and the extra 8 bytes is a pretty small cost to pay when not needed. But I do still have some questions more generally:

  • Even if we add the chunk count field, nvcomp will still need extra metadata to specify the compression format (and perhaps other information?). Therefore, while other zarr implementations will be able to read data produced by nvcomp, nvcomp will not be able to read data produced by other zarr implementations, since it will lack the extra metadata. This seems like it may be confusing to users. Additionally, I recall that you said in the meeting on 2023-02-15 that reading with nvcomp was more the focus than writing with nvcomp; but reading will not in general be supported, only if the data was produced by nvcomp.
  • Even if nvcomp is able to get the metadata it needs from the snard file, as discussed in the meeting on 2023-02-15, it still will just decode it as a list of byte strings, because that is its data model. But that is unlikely to be useful on its own --- a consumer of a zarr shard will almost surely want to deal with it as an array, and will therefore need to know the dimensions of the per-shard chunk grid, the dimensions of each chunk, the data type, the fill value, etc. Given that this information must be known to the consumer of the zarr shard (presumably by reading the zarr metadata), the compression format and number of chunks per shard should also be readily available, and could easily be supplied to nvcomp.

I think if you describe some use cases / user stories for decoding a zarr shard without knowledge of the zarr metadata, it would help to clarify these issues.

@mkitti
Copy link
Contributor

mkitti commented May 3, 2023

It would be good to compare the chunk index to a HDF5 fixed array index, especially the data block.

https://docs.hdfgroup.org/hdf5/develop/_f_m_t3.html#FixedArray

Screenshot_20230503_115726_Chrome.jpg

The fixed array data block consists of a 14 byte header, followed by data block elements, followed by a 4 byte checksum (Jenkin's lookup3).

Each data block element consists of an address, chunk size, and filter mask. This is very similar to the chunk index here with an 8 byte address and an 8 nbytes field. There is an additional 4 byte filter mask field. The filter mask indicates whether filters (e.g. compression) have been applied or not.

Screenshot_20230503_115911_Chrome.jpg

The besides the header structures and the checksum, the main difference here is the additonal per chunk filter mask. Another potential difference is that the size field, nbytes here, can be smaller.

The size of the size, nbytes, field is determined by a 28 header as the "entry size".

Screenshot_20230503_121412_Chrome.jpg

The Jenkin's lookup3 checksum is described here:
http://www.burtleburtle.net/bob/hash/doobs.html

In summary, the main difference between the proposed chunk index is a 4 byte filter mask per chunk. Another difference is that the size of the nbytes field can be smaller than 8 bytes. Additionally, there is the continuing theme of using checksums to ensure data integrity.

@jbms
Copy link
Contributor

jbms commented May 3, 2023

Thanks for this analysis @mkitti.

  • Regarding variable size encoding of lengths (and offsets): if we use variable-length encoding, then the entire index becomes variable size, which means that we would have to first read the length of the index somewhere, which means that reading the data if the shard index is not already cached would require 3 sequential reads, rather than just 2.
  • Regarding checksums: adding a checksum of the shard index actually seems like a pretty good idea, particularly given that there doesn't seem to be an easy way to add it in separately. In contrast, checksums of the individual chunks within the shard could easily be accommodated by an extra checksum codec, which would be equally useful both with and without sharding. To me the natural choice would be adding a 4-byte crc32c checksum to the end of the shard index.
  • Regarding the filter mask: the use cases for this are a bit unclear to me. If partial reads are not needed, then this could just be handled by another separate codec, that basically adds a header next to the encoded data to indicate which other codecs have been applied.

@mkitti
Copy link
Contributor

mkitti commented May 3, 2023

  • Regarding variable size encoding of lengths (and offsets): if we use variable-length encoding, then the entire index becomes variable size, which means that we would have to first read the length of the index somewhere, which means that reading the data if the shard index is not already cached would require 3 sequential reads, rather than just 2.

I'm not sure if each "sequential read" includes parsing or just grabbing the bytes. If we considered 64-bits to be the largest possible of nbytes, we could always read enough of the footer assuming 64-bit nbytes, and then evaluate later if the index may be smaller. That said alternating 64-bit and say 32-bit integers is not as friendly to SIMD based processing as just a list of 64-bit integers.

The main advantage of using a smaller field for nbytes is that the index could be smaller in size if the chunks are not very large.

Another possible advantage of the HDF5 spec here is the ability to also declare a larger field size (128-bit or 256-bit integer sizes) if needed for both addreses and the number of bytes in a chunk. This might be a good move to "future" proof the spec. Admittedly data at this scale is difficult for me to imagine today.

partially discussed at the community meeting

@jbms
Copy link
Contributor

jbms commented May 3, 2023

I'm not sure if each "sequential read" includes parsing or just grabbing the bytes. If we considered 64-bits to be the largest possible of nbytes, we could always read enough of the footer assuming 64-bit nbytes, and then evaluate later if the index may be smaller. That said alternating 64-bit and say 32-bit integers is not as friendly to SIMD based processing as just a list of 64-bit integers.

I suppose there are lots of things to potentially optimize for. If the inner chunks are greater than 16KB then the space overhead of the current index will be less than 0.1%, and therefore it seemed unlikely to be particularly important to optimize it further. But I suppose if the chunks are very small, or many are not present, or the overall amount of data is huge and you care about large absolute savings even if they are proportionally tiny, then the size of the index becomes more of a concern.

If using a variable-length index, there are a lot of ways to potentially optimize the index size:

  • Compress with standard compression algorithm like zstd
  • Use delta encoding of offsets and/or sizes
  • Use varint encoding of offsets and/or sizes

But it is not clear that the reduced storage size would be worth the added complexity of these changes.

One advantage of the current fixed layout of the index is that a zarr implementation could potentially read just a portion of the shard index, though it would then not be possible to verify the checksum, and likely it would be better to just use nested sharding rather than partial reading of the shard index.

The main advantage of using a smaller field for nbytes is that the index could be smaller in size if the chunks are not very large.

Another possible advantage of the HDF5 spec here is the ability to also declare a larger field size (128-bit or 256-bit integer sizes) if needed for both addreses and the number of bytes in a chunk. This might be a good move to "future" proof the spec. Admittedly data at this scale is difficult for me to imagine today.

Because we have a json representation of the codec metadata, it is easy to add additional attributes later to specify a different representation of the offset or size fields, so I don't think there is a need to add something now for future proofing.

In general I expect that once a zarr implementation has implemented sharding, additional supporting other shard index formats would require minimal added effort.

@jstriebel jstriebel mentioned this pull request May 6, 2023
@jbms
Copy link
Contributor

jbms commented May 9, 2023

See #237 for a PR to add a checksum to the index.

@mkitti
Copy link
Contributor

mkitti commented Jun 7, 2023

  • Regarding the filter mask: the use cases for this are a bit unclear to me. If partial reads are not needed, then this could just be handled by another separate codec, that basically adds a header next to the encoded data to indicate which other codecs have been applied.

The main application is handling codec failures while writing shards from a volatile buffer. For example, a sensor is writing data to a circular buffer which is being chunked and compressed into a Zarr shard. We have a limited time to move the data out of the circular buffer before the data is overwritten. During this process the compression codec could fail.

Possible codec failures include:

  1. Data corruption. The filtered (compressed) data cannot be reverisbly unfiltered (decompressed) correctly.
    i) An example of this is the rare Zstandard corruption bug corrected in version 1.5.5: https://github.com/facebook/zstd/releases/tag/v1.5.5
  2. Insufficient memory. There is insufficient memory to allocate buffers for the compressor.
  3. The codec becomes unavailable. For example, the codec is GPU accelerated but the GPU device has failed.
  4. The compressed data is larger than the uncompressed data. For example, see the hdf5-blosc filter:
    i) https://github.com/Blosc/hdf5-blosc/blob/9683f7d82fdc5a8010ff43acb458169a0df9c0ef/src/blosc_filter.c#L197-L201

At the moment it is unclear how to handle a failure in chunk processing within a Zarr shard. If we have a fixed order chunk indexes, this suggests that the index cannot omit a failing chunk entirely. We would also lose the primary data if we do not write the chunk to some location.

If a codec fails to apply to one chunk, is the entire shard invalid? How do we differentiate an empty chunk from a chunk that failed codec processing?

The simplest failure recovery is to write the raw chunk data in lieu of applying the codec.

I propose we allocate a single bit per chunk to indicate a codec failure and that the data represents the raw, unprocessed chunk data. This is significantly simpler than HDF5's 32 bits, one for each filter in a pipeline. The single bit simply indicates whether the chunk has been processed or not. This bit could be the highest bit of nbytes.

A simple implementation would be to regard nbytes as an int64. If nbytes is positive, then the value represents the number of bytes of the chunk processed by the given codec(s). If nbytes is negative, then the additive inverse of nbytes represents the number of bytes of the unprocessed chunk without the codec applied.

We previously proposed a value of nbytes of 2^64 - 1, or 0xffffffffffffffff to indicate an empty chunk. Interpreted as an int64 those bits would be read as -1. To keep the bits the same, a revised implementation would be to regard the number of the number of bytes as -(nbytes + 1) when nbytes < 0. Thus 0xffffffffffffffff maps to an empty chunk.

Another option is to change the sentinel value for an empty chunk to different value.

Being able to mark some chunks as unprocessed would allow progressive processing of chunks in a shard. This would allow chunks to be individually compressed over time rather than requiring all chunks in a shard to be compressed at once.

Additionally, allowing some chunks to be unprocessed could allow for performance optimization. There may be some chunks where the compression ratio is poor enough that the decompression time is not worthwhile.

The alternative as @jbms pointed out is to use a separate codec that adds a header to the chunk. For example, for each codec, we create an "Optional" version of the codec that has a one byte header which indicates if the codec was successfully applied or not.

Now that the checksum has been proposed one difference between the "raw bit" scheme and the "optional codec" is that the "raw bit" is incorporated into the checksum. The other obvious difference is that chunks would be limited to ~9 exabytes rather than ~18 exabytes.

@jbms
Copy link
Contributor

jbms commented Jun 7, 2023

@mkitti Thanks for the explanation.

To be clear, the sentinel value of 2^64-1 for both offset and length in the shard index indicates a missing chunk, not an empty (i.e. present, but zero-length) encoded chunk. An empty (zero length) chunk would be indicated by a length of 0 and an arbitrary offset, but depending on the codec that may not be valid.

In any case, it is true that just as with plain zarr when not using the sharded format, there is no way to distinguish between a missing chunk that has simply not been written, and a chunk for which an attempt to write it was made, but an error occurred and it could not be written. You would need a separate way to indicate that, unless it is expected that the array will be written densely and therefore any missing chunk indicates an error.

The intention with sharding is to improve read and write efficiency but not to otherwise change the zarr data model. It seems to me that this issue of wanting to store some chunks uncompressed is orthogonal to sharding, and therefore it would be cleaner to support it as a separate codec rather than trying to integrate it into sharding. That way it can also be used when not using the sharded format.

@mkitti
Copy link
Contributor

mkitti commented Jul 5, 2023

@ajelenak will be visiting @JaneliaSciComp on Tuesday July 11th, and I'm settting aside some time to look over this specification and discuss it.

Preliminarily, I'm writing this in for some time between 10:30 am and noon ET on Tuesday July 11th. If anyone is interested in joining that discussion remotely, please let me know via email at [email protected].

@mkitti
Copy link
Contributor

mkitti commented Jul 12, 2023

I've done a proof-of-concept test that relocates a HDF5 file Fixed Address Data Block (FADB) to the end of the file such that it can be reused as the chunk index table for a Zarr shard:
https://github.com/mkitti/hdf5_zarr_shard_demo/blob/main/hdf5_zarr_shard_demo.py

This allows a file to be simultaneously a HDF5 file and a Zarr shard according to this specification without duplicating the chunk offsets and nbyte sizes. The main matter to discuss at the end is the checksum.

HDF5 chunk sizes are limited to 4 bytes (32-bits), meaning that chunks can only be as big as 4 GB. For small chunks, HDF5 might encode the chunk sizes in a fewer number of bytes. In the general case, some byte padding may be needed.

The table below is a key for the FADB bytes at the end of the HDF5 file that I created and manipulated.

FADB version client header chunk(0,0) chunk(0,1) ...
sig offset offset nbytes mask offset nbytes mask ... checksum
uint32 uint8 uint8 uint64 uint64 uint32 uint32 uint64 uint32 uint32 ... uint32
$ python hdf5_zarr_shard_demo.py 
Data via h5py:
[[ 1  1  1 ...  4  4  4]
 [ 1  1  1 ...  4  4  4]
 [ 1  1  1 ...  4  4  4]
 ...
 [13 13 13 ... 16 16 16]
 [13 13 13 ... 16 16 16]
 [13 13 13 ... 16 16 16]]

Zarr Shard Chunk Offset Nbyte Pairs:
[(2048, 4087), (6135, 4087), (10222, 4087), (14309, 4087), (18396, 4087), (22483, 4087), (26570, 4087), (30657, 4087), (34744, 4087), (38831, 4087), (42918, 4087), (47005, 4087), (51092, 4087), (55179, 4087), (59266, 4087), (63353, 4087)]

$ hd hdf5_zarr_shard_demo.h5 | tail -n 19
00010770  46 41 44 42 00 01 cf 01  00 00 00 00 00 00 00 08  |FADB............|
00010780  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 f7 17  |................|
00010790  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 ee 27  |...............'|
000107a0  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 e5 37  |...............7|
000107b0  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 dc 47  |...............G|
000107c0  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 d3 57  |...............W|
000107d0  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 ca 67  |...............g|
000107e0  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 c1 77  |...............w|
000107f0  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 b8 87  |................|
00010800  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 af 97  |................|
00010810  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 a6 a7  |................|
00010820  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 9d b7  |................|
00010830  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 94 c7  |................|
00010840  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 8b d7  |................|
00010850  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 82 e7  |................|
00010860  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 79 f7  |..............y.|
00010870  00 00 00 00 00 00 f7 0f  00 00 00 00 00 00 6e 96  |..............n.|
00010880  07 85                                             |..|
00010882
Hex Value Explanation
46 41 44 42 FADB Signature of HDF5 Fixed Address Data Block
00 0 Version 0 (currently the only FADB version)
01 1 Filtered dataset chunks ( as opposed to non-filtered dataset chunks )
cf 01 00 00 00 00 00 00 0x1cf (436) Byte offset of the HDF5 Fixed Address Header Block
00 08 00 00 00 00 00 00 0x800 (2048) Chunk (0,0) offset. The default HDF5 meta block size is 2 kilobytes
f7 0f 00 00 0xff7 (4087) Chunk (0,0) nbytes. The 4096x4096 uint8 chunk is gzip compressed
00 00 00 00 0b0...0 Chunk (0,0) filter mask. All filters were applied
f7 17 00 00 00 00 00 00 0x17f7 (6135) Chunk (0,1) offset. 6135 = 2048+4087
f7 0f 00 00 0xff7 (4087) Chunk (0,1) nbytes. The 4096x4096 uint8 chunk is gzip compressed
00 00 00 00 0b0...0 Chunk (0,1) filter mask. All filters were applied
ee 27 00 00 00 00 00 00 0x27ee (10222) Chunk (0,2) offset. 10222 = 2048+4087+4087
f7 0f 00 00 0xff7 (4087) Chunk (0,2) nbytes. The 4096x4096 uint8 chunk is gzip compressed
00 00 00 00 0b0...0 Chunk (0,2) filter mask. All filters were applied
... ... ...
79 f7 00 00 00 00 00 00 0xf779 (63353) Chunk (3,3) offset. 63353 = 2048+4087*15
f7 0f 00 00 0xff7 (4087) Chunk (3,3) nbytes. The 4096x4096 uint8 chunk is gzip compressed
00 00 00 00 0b0...0 Chunk (3,3) filter mask. All filters were applied
6e 96 07 85 0x8507966e Jenkin's lookup3 hash, via jenkins-cffi (pypi)

Interpreted by the current ZEP2 specification with added checksum, the offsets and nbyte sizes will be interpretted correctly due to little endian order when the filter mask bits are all 0 (all filters applied).

Hex Value Explanation
00 08 00 00 00 00 00 00 0x800 (2048) Chunk (0,0) offset. The default HDF5 meta block size is 2 kilobytes
f7 0f 00 00 00 00 00 00 0xff7 (4087) Chunk (0,0) nbytes. mask is included as most significant bytes
f7 17 00 00 00 00 00 00 0x17f7 (6135) Chunk (0,1) offset. 6135 = 2048+4087
f7 0f 00 00 00 00 00 00 0xff7 (4087) Chunk (0,1) nbytes. mask is included as most significant bytes
ee 27 00 00 00 00 00 00 0x27ee (10222) Chunk (0,2) offset. 10222 = 2048+4087+4087
f7 0f 00 00 00 00 00 00 0xff7 (4087) Chunk (0,2) nbytes. mask is included as most significant bytes
... ... ...
79 f7 00 00 00 00 00 00 0xf779 (63353) Chunk (3,3) offset. 63353 = 2048+4087*15
f7 0f 00 00 00 00 00 00 0xff7 (4087) Chunk (3,3) nbytes. mask is included as most significant bytes
6e 96 07 85 0x8507966e Jenkin's lookup3 hash, via jenkins-cffi (pypi)

The main outstanding matter is the four byte checksum which uses Bob Jenkin's lookup3 hash function. The implementation is widely available and packaged across many languages.

I can work on a pull request to have Jenkin's lookup3 added to numcodecs.

Would it be possible to have the checksum specified as any 32-bit checksum codec rather than just crc32 as in #237 ?

@rabernat
Copy link
Contributor

@mkitti this is super cool! I love the idea that a Zarr shard can be a valid HDF file!

The main outstanding matter is the four byte checksum which uses Bob Jenkin's lookup3 hash function.

This seems easily resolvable.

@normanrz
Copy link
Member

Would the hdf5 shard files need to have an .h5 suffix? That would also need to go into the Zarr metadata. Probably into chunk_key_encoding.

@jbms
Copy link
Contributor

jbms commented Jul 12, 2023

I don't believe the hdf5 library requires a particular extension.

As for supporting different checksums, I think it would be reasonable to say that the index is logically a uint64 array of shape chunk_shape+[2] where the last dimension indexes over offset/length, and is encoded using a separate codec chain specified by "index_codecs". The existing format would be equivalent to:

"index_codecs": [{"name":"endian", "endian":"little"},{"name":"crc32c"}]

It would be required that the index_codecs chain produce a fixed size output since the implementation needs to know how many bytes must be read from the end. This would address some other issues that have been raised, like storing the offsets and lengths as uint32 --- that could now be done with an array->array data type conversion codec.

Still, we might want to consider if this added complexity is worthwhile.

@mkitti
Copy link
Contributor

mkitti commented Jul 12, 2023

Would the hdf5 shard files need to have an .h5 suffix?

No. Most of the HDF5 files I have encountered in the wild use a different extension. For example,

  • .ims for Imaris files
  • .lsm for old Zeiss files
  • .nc for netCDF4 files
  • .nx5 for NEXUS files

The main thing needed for a valid HDF5 file is a 48 byte superblock at offset 0, 512, 1024, 2048, or some doubling of bytes thereof within the file. This primarily affects the beginning of the file and is not of direct consequence to the specification here since offsets do not need to start at 0.

@mkitti
Copy link
Contributor

mkitti commented Jul 12, 2023

@jbms brought up a good point that the HDF5 jenkin's lookup3 checksum includes the fourteen prior bytes b'FADB\x00\x01'. So we would need a specialized codec that reads those fourteen bytes.

@mkitti
Copy link
Contributor

mkitti commented Jul 12, 2023

I was looking through the HDF5 commit history and found this note from @qkoziol :

Add 'loookup3' checksum routine and switch to using it for metadata checksums - it's just as "strong" as the CRC32 and about 40% faster in general (with some compiler optimizations, it's nearly as fast as the fletcher-32 algorithm).

It appears that before introducing this change HDF5 used a mix of crc32 and fletcher32 depending on the size being hashed.

@jbms
Copy link
Contributor

jbms commented Jul 12, 2023

I'm not an expert on hashing but my understanding is that crc32c has hardware support on x86_64 and armv8 and is therefore pretty fast on those platforms. It also seems to be the default checksum choice for a lot of purposes. But separate from the question of exactly which checksums to support, it may make sense to add support for an index_codecs option, though it would help to better understand possible use cases beyond HDF5 compatibility.

@qkoziol
Copy link

qkoziol commented Jul 13, 2023

I was looking through the HDF5 commit history and found this note from @qkoziol :

Add 'loookup3' checksum routine and switch to using it for metadata checksums - it's just as "strong" as the CRC32 and about 40% faster in general (with some compiler optimizations, it's nearly as fast as the fletcher-32 algorithm).

It appears that before introducing this change HDF5 used a mix of crc32 and fletcher32 depending on the size being hashed.

True, although there were no releases of HDF5 that used those algorithms to checksum file metadata. The algorithms were added to the library during development of features for the 1.8 releases and replaced with the lookup3 algorithm before the 1.8.0 release.

@jstriebel
Copy link
Member Author

Closing this review PR in favor of a new issue for the vote about ZEP2 @normanrz is creating later, to separate the voting from this long discussion phase. The initial proposal was changed in response to many comments here.

@jstriebel jstriebel closed this Jul 27, 2023
@normanrz
Copy link
Member

The review and discussion will continue here: #254

@normanrz normanrz deleted the ZEP0002-spec branch July 27, 2023 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants