-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
codec specification in v3 #293
Comments
I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers. A zarr implementation does not need to support every codec to be conformant, but spec'ing codecs and supporting them across more than just one implementation is essential to move forward and increase adoption. What better place to put zarr codec specs than alongside the zarr spec?
A codec does not have to start with a spec, it can start with an experimental implementation. That is basically what most of the codecs in numcodecs are. Similarly, I have multiple experimental Zarr V3 codecs implemented in zarrs that I plan to put forward once the new ZEP process has been figured out. |
I agree with this completely. My concern here is not whether we should standardize codecs; it's whether we should standardize codecs inside the Zarr specification document, or in a separate specification document.
I think outside the Zarr spec entirely is the best place to put the codec specs. The codecs don't depend on Zarr; instead, Zarr depends on them.
That's a good idea, but technically your codecs cannot start with an experimental implementation. According to the text of the spec, your experimental codec is only valid when it is defined in a separate specification, and you give your codec a URI that resolves to a human-readable specification of the codec. Personally I don't think this is a reasonable requirement for experimental codecs. |
Just copying my response from the zarr-python thread here:
|
@normanrz could you elaborate on these points a bit? Do you think the spec should require or merely suggest that implementations support a fixed set of codecs? If you want this to be a requirement, how would we enforce it? Given that the spec currently requires that all codecs have a specification, how do we formally distinguish "standard" from "non-standard" codecs? What is the process for converting a "non-standard" codec to a "standard codec", or vice versa? |
Some codecs are essential to how Zarr works and should be required by all implementations. Most minimally, that is the
I like to think that enforcement of the Zarr spec comes through validation from multiple implementations. When opening an array or group, implementations parse the metadata and therefore implicitly or explicitly validate the metadata.
"Standard" codec get a short name assigned by the Zarr spec (e.g.
I think we can use the ZEP process for that. Implementations that support non-standard codecs might need to support both names once a codec becomes standardized.
From a theoretical pov, I can see that splitting the codec spec from Zarr might make sense. From a practical pov, I don't see how that would make anything easier or facilitate interoperability among the Zarr impls. I think it is best to keep the codec spec in the Zarr spec. |
Is the current set of codecs inside the zarr spec? I think this is actually the root of my concern. |
given that the zarr v3 spec document itself says that it doesn't define a list of codecs (and this claim is internally consistent -- that document does not in fact define a list of codecs), what spec are are the codec definitions part of? |
I think they are.
I think it is unfortunate that the paragraph you cite did not get updated during the v3 spec process (a quick git blame shows that). I agree that it is inconsistent because the spec actually lists codecs. Most implementations have implemented this list of codecs. We should certainly revise this paragraph. |
I agree with this from two points of view:
So I am in favour of having a finite set of codecs included in the zarr spec that implementations must support. To come back on some of the concerns above:
I'm not sure this is true - most of (all?) the codecs currently used by
Supporting sharding is not essentials for users who don't want sharded data, but it is a useful enough feature for enough people that it's worth mandating it as part of the spec, so for those users who want to use it they know it is guarenteed to be supported. I think the same argument holds for a list of standard codecs - I might not want to use all of them, but I want to be guarenteed that the one I do use is supported by all implementations.
Well, there's no 'enforcement mechanism' for any of the spec, but if someone wants to claim the have written an implementation then they have to implement the whole spec. I'm not sure why codecs would be any different here? |
So it seems like most people in this conversation believe that the v3 spec should specify a set of codecs that Zarr implementations must support. This is at variance with the language of the spec today:
To make the spec document match the general opinion expressed in this issue (i.e., that the spec should list a required set of codecs), we need to make the following changes:
Do these changes seem sufficient? If so, we can start writing up a ZEP. |
Regarding the |
I will summarize a few concerns I have about the way codecs are handled in the v3 spec, and propose some changes that I think could improve this situation.
the codec problem space
We need Zarr implementations across multiple languages to agree on standard JSON serialization for different codecs. This protects users from fragmentation, e.g. a situation where we end up with multiple flavors of JSON serialization for the same popular codec. At the same time, we want to make it easy for users to experiment with and create new codecs; this enables users to get the most from Zarr.
Also, codecs are generally useful for users outside of Zarr. There are plenty of non-Zarr use cases for compressing / rearranging array data. So I think the codec standardization should support these non-Zarr use cases.
concerns with codecs in the v3 spec
zarr-python
.zarr-specs
, nobody would ever write a new codec.Software cannot check if a URI dereferences to a human-readable document. If we want Zarr v3 hierarchies to be validated by software, we must remove this requirement.
how to resolve these concerns
I don't think naming a closed set of "official codecs" in the spec is realistic. There is no enforcement mechanism, and ultimately users don't care if an implementation doesn't support a codec they don't use. That is, if an implementation doesn't support codec X, and none of the users of that implementation use codec X, then IMO this is fine.
To express this differently, I think the Zarr spec should not enumerate the features / behavior an implementation must have. The Zarr spec should just describe the Zarr format, and we leave it to implementations to choose how they implement that format.
Extending this logic, the Zarr format is actually agnostic with respect to particular codecs. So specific codecs should not appear in the Zarr spec! I actually think codecs should be defined entirely in another spec, and we refer to this spec in the Zarr spec, e.g. "codecs is a JSON array of JSON objects that implement the Numcodecs spec (link to the numcodecs spec)" (we can choose a different name for the codecs spec, but it shouldn't refer to zarr).
Recall that In Zarr v2, codecs were basically standardized by the behavior of the
numcodecs
python library, which was a stand-alone library with no Zarr dependency. I think this illustrates the right relationship between codecs and the zarr format, but we shouldn't rely on a python library to define a standard for a cross-language concern. Zarr v3 tries to fix the latter problem by folding codec definition inside the spec itself, but as I have argued, this introduces a different set of problems. The solution is to define codecs separately, and make the zarr spec depend on that codec spec. The codec specification can manage a registry of codecs, etc, thereby abstracting the current behavior ofnumcodecs
in a language-agnostic way.Another advantage of a separate spec for codecs is that this spec could be used by any project that wants to compress arrays in a standard way. There is nothing Zarr-specific about serializing GZip parameters to JSON, so lets reflect this in the structure of the specification document.
tldr; I think the list of codecs in v3 is trying to solve a problem (a language-agnostic list of codecs) that we can solve in a better way: by migrating the codec specification from Zarr v3 into its own spec.
is this too much churn in the spec
I know it sucks to hear complaints about the spec after it's been finalized. Sorry. But I want zarr v3 to be really good, and I think the way we do codecs in v3 right now is very problematic; if my concerns are valid, then we owe it to users to get this resolved as soon as possible.
The text was updated successfully, but these errors were encountered: