Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wrappers for numcodecs in v3 #1839

Closed
wants to merge 7 commits into from
Closed

Add wrappers for numcodecs in v3 #1839

wants to merge 7 commits into from

Conversation

normanrz
Copy link
Member

@normanrz normanrz commented May 5, 2024

The Zarr v3 specification only lists a few codecs that are officially supported. However, it is desirable to expose the codecs in numcodecs for use with v3 arrays as well. This PR adds wrapper classes for numcodecs support.

The name of the codecs is prefixed with https://zarr.dev/numcodecs/ to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec. Also, there is a warning that numcodecs codecs are not officially supported and will likely not work in any other Zarr implementation.

Most array-to-array ("filters") and bytes-to-bytes codecs are supported. Absent are the variable-length codecs as well as json, msgpack and pickle.

Here is an example of the persisted configuration:

{"name": "https://zarr.dev/numcodecs/fixedoffsetscale", "configuration": {"offset": 0, "scale": 51, "astype": "uint16"}}

This PR adds a fairly close coupling with the contents of numcodecs. It is required to accommodate the new codec API in zarr-python. However, this coupling is unfortunate. Ideally, this code could be moved to the numcodecs package. Due to the use of the Codec ABCs and helper functions, this would introduce a circular dependency. Not sure if it would be possible to make it a conditional module within numcodecs. Happy to hear any thoughts on this.

Use of numcodecs in v2 arrays is not affected.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@normanrz normanrz added the V3 label May 5, 2024
@normanrz normanrz self-assigned this May 5, 2024
@pep8speaks
Copy link

pep8speaks commented May 5, 2024

Hello @normanrz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-05-06 19:56:45 UTC

@d-v-b
Copy link
Contributor

d-v-b commented May 6, 2024

Thanks @normanrz, personally I think these changes are good, and thanks for working to bridge numcodecs into v3.

zarr / numcodecs relationship

As you note, there is a circular dependency issue if we wanted to put zarr v3 data structures in numcodecs. On the other hand, it would be unfortunate if we needed to keep numcodecs-specific adapter code around in the main release, given that we have complete control over numcodecs and can do whatever we need to do with the numcodecs apis. I see the following options:

  • zarr-python implements v3 adapters for numcodecs (this PR)
  • duplicate codec data structures in zarr-python and numcodecs (bad because code duplication, and it ties numcodecs to zarr v3)
  • zarr-python imports codec data structures from numcodecs (bad because it ties numcodecs to zarr v3)
  • we specify the codec data structures in a completely separate specification, reference that codec spec in the zarr v3 spec, and refactor numcodecs as an implementation of the codec spec. Then zarr-python just imports from numcodecs without any adapters.

The last option is the most work, but the cleanest result in my opinion. curious to hear if there are any other ideas.

issues with the v3 codec spec

I think this PR exposes some issues with how codecs are defined in the v3 spec.

@normanrz wrote:

The Zarr v3 specification only lists a few codecs that are officially supported.

Does the spec really only list a few official codecs? the v3 spec states (emphasis mine)

To allow for flexibility to define and implement new codecs, this specification does not define any codecs, nor restrict the set of codecs that may be used. Each codec must be defined via a separate specification. In order to refer to codecs in array metadata documents, each codec must have a unique identifier, which is a URI that dereferences to a human-readable specification of the codec. A codec specification must declare the codec identifier, and describe (or cite documents that describe) the encoding and decoding algorithms and the format of the encoded data.

If the true intention of the spec is to list a few officially supported codecs, then we must change the quoted text from the spec, because right now it explicitly says that the spec does not define a list of codecs.

Also, I'm quite confused about requiring that the codec identifier be a URI. In your example, "https://zarr.dev/numcodecs/fixedoffsetscale" does not resolve to a human readable specification of the codec, so it's not spec-compliant.

Indeed If we want zarr-python or any other implementation to be strict about the spec (and I think we do), then those libraries must check that every codec has a URI that resolves to a human-readable specification of the codec. This is simply not possible -- zarr-python cannot check that a URI resolves to a human readable document. So this is a part of the spec that an implementation cannot check, which to me is a big problem for the spec from a validation POV.

We should strive to make zarr hierarchy validation robust -- it should be possible to validate a zarr hierarchy without an internet connection. The language about URIs makes this impossible right now. And I think if the spec contains language that encourages people to make fake URIs that don't actually point to anything, then that's another sign of a correctable defect in the spec. I would be very interested in amending the spec to fix this problem.

@jeromekelleher
Copy link
Member

FWIW I've been confused about these issues also, and agree with @d-v-b

From my perspective, an awful lot of what people do is covered by Blosc, so giving it special status in the spec might be pragmatic?

@normanrz
Copy link
Member Author

normanrz commented May 6, 2024

The Zarr v3 specification only lists a few codecs that are officially supported.

Does the spec really only list a few official codecs? the v3 spec states (emphasis mine)

To allow for flexibility to define and implement new codecs, this specification does not define any codecs, nor restrict the set of codecs that may be used. Each codec must be defined via a separate specification. In order to refer to codecs in array metadata documents, each codec must have a unique identifier, which is a URI that dereferences to a human-readable specification of the codec. A codec specification must declare the codec identifier, and describe (or cite documents that describe) the encoding and decoding algorithms and the format of the encoded data.

If the true intention of the spec is to list a few officially supported codecs, then we must change the quoted text from the spec, because right now it explicitly says that the spec does not define a list of codecs.

I agree that the spec could use some clarification in that regard. @d-v-b I am not sure whether you are advocating for a closed list of codec (i.e. blosc, gzip etc.) or an entirely open set? I think a closed list that are expected to be implemented by all libraries is desirable. An open set would make validation and interoperability a lot more complicated.

Providing flexibility for experimenting with other codecs through clearly namespaced codecs seems like a good compromise, though. Whether the URI needs to point to a human-readable spec or just needs to be a unique URI is debatable. I assume you would like to see that URI resolve to a JSON schema or similar for validation purposes.

Also, I'm quite confused about requiring that the codec identifier be a URI. In your example, "https://zarr.dev/numcodecs/fixedoffsetscale" does not resolve to a human readable specification of the codec, so it's not spec-compliant.

That is correct. We could easily add a redirect at that URI that points to the numcodecs docs, though. I wanted to pick a URI prefix that is owned by the Zarr community (ie. not https://numcodecs.readthedocs.io).

Regarding the numcodecs <> zarr relationship, I am leaning towards moving the zarr.codecs.numcodecs_ module to numcodecs.zarr3. I think it could work, if the entire module is wrapped in a try/except to catch the ModuleNotFoundError when zarr is not present. That would mean that the module would only work when zarr is present in the environment. numcodecs should then also register the codecs via entry points.

@jhamman
Copy link
Member

jhamman commented May 6, 2024

Due to the use of the Codec ABCs and helper functions, this would introduce a circular dependency. Not sure if it would be possible to make it a conditional module within numcodecs. Happy to hear any thoughts on this.

Would it make sense for us to switch from ABCs to Protocols for the Codecs? I believe that would help us avoid making Zarr a dependency of Numcodecs.

@normanrz
Copy link
Member Author

normanrz commented May 6, 2024

I made a prototype what the integration within numcodecs could look like: zarr-developers/numcodecs#524
Actually not too bad, imho.

@normanrz
Copy link
Member Author

normanrz commented May 6, 2024

Due to the use of the Codec ABCs and helper functions, this would introduce a circular dependency. Not sure if it would be possible to make it a conditional module within numcodecs. Happy to hear any thoughts on this.

Would it make sense for us to switch from ABCs to Protocols for the Codecs? I believe that would help us avoid making Zarr a dependency of Numcodecs.

There is quite a bit of implementation in the ABCs which makes it harder to switch them to Protocols.

@d-v-b
Copy link
Contributor

d-v-b commented May 6, 2024

I agree that the spec could use some clarification in that regard. @d-v-b I am not sure whether you are advocating for a closed list of codec (i.e. blosc, gzip etc.) or an entirely open set? I think a closed list that are expected to be implemented by all libraries is desirable. An open set would make validation and interoperability a lot more complicated.

If the zarr v3 spec defines a closed set of codecs, then we can't develop new codecs without changing the spec. That seems very undesirable. So I think the alternative (the zarr v3 spec does not define a closed set of codecs) is our only option.

That doesn't mean we shouldn't standardize codecs, simply that the list of zarr codecs should not be part of the zarr spec. I think a list of codecs is actually very useful for implementations, especially if they are specified in a language-agnostic way. But I'm not convinced that there's value in forcing implementations to implement N different codecs in order to be "real" zarr implementations. We should just have a list of codec specifications, and zarr implementations (or non-zarr applications that just want a nice JSON serialization of compression routines) can implement the codecs that are useful for them.

@mkitti
Copy link

mkitti commented May 6, 2024

From my perspective, an awful lot of what people do is covered by Blosc, so giving it special status in the spec might be pragmatic?

A particular problem here is there is much less maintenance effort on Blosc1 compared to Blosc2. I recommend that we view the Blosc1 frame format as deprecated.

@normanrz
Copy link
Member Author

normanrz commented May 6, 2024

I agree that the spec could use some clarification in that regard. @d-v-b I am not sure whether you are advocating for a closed list of codec (i.e. blosc, gzip etc.) or an entirely open set? I think a closed list that are expected to be implemented by all libraries is desirable. An open set would make validation and interoperability a lot more complicated.

If the zarr v3 spec defines a closed set of codecs, then we can't develop new codecs without changing the spec. That seems very undesirable. So I think the alternative (the zarr v3 spec does not define a closed set of codecs) is our only option.

That doesn't mean we shouldn't standardize codecs, simply that the list of zarr codecs should not be part of the zarr spec. I think a list of codecs is actually very useful for implementations, especially if they are specified in a language-agnostic way. But I'm not convinced that there's value in forcing implementations to implement N different codecs in order to be "real" zarr implementations. We should just have a list of codec specifications, and zarr implementations (or non-zarr applications that just want a nice JSON serialization of compression routines) can implement the codecs that are useful for them.

We should probably move this discussion to zarr-specs because it is not really about the zarr-python impl.
Anyways, I think it is useful to have minimal set of codecs that we expect any zarr impl to support (e.g. bytes, transpose, blosc). Other codecs can be optional. I think the zarr specification is a actually good place to list available codec specs.

I feel quite strongly, that non-standard codecs need to be labeled as such (e.g. through URI-style naming instead of short names). Having multiple codecs (even if the encoded format is only slightly different) with the same name would be a desaster. Perhaps zarr-python should even enforce that (ie. don't allow short names for non-standard codecs).

@d-v-b
Copy link
Contributor

d-v-b commented May 7, 2024

We should probably move this discussion to zarr-specs because it is not really about the zarr-python impl.

Agreed, I will write up my concerns in a zarr-specs issue.

@d-v-b
Copy link
Contributor

d-v-b commented May 7, 2024

see zarr-developers/zarr-specs#293 for a discussion of codecs in the v3 spec

@normanrz
Copy link
Member Author

normanrz commented May 7, 2024

Circling back to the original discussion of this PR.

I made a prototype what the integration within numcodecs could look like: zarr-developers/numcodecs#524 Actually not too bad, imho.

I am actually quite pleased how the implementation on the numcodecs side looks like. I am inclined to close this PR in favor of zarr-developers/numcodecs#524. What do you all think?

@normanrz
Copy link
Member Author

normanrz commented May 8, 2024

Closing this in favor of zarr-developers/numcodecs#524

@normanrz normanrz closed this May 8, 2024
@normanrz normanrz deleted the v3-numcodecs branch May 8, 2024 18:49
@jhamman jhamman added this to the 3.0.0.alpha milestone May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants