-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lessons to learn from STAC's extensibility #316
Comments
Worth noting that the first and third sentences are blatantly contradictory! 🙃
💯 this sounds like a great idea. I think requiring a ZEP for every extension is a headache and the end result will be that nobody does it. I'd be happy adjusting #312 along the lines of a separate |
Thanks for sharing this Tom. It has been great to have you spending time on Zarr recently and bringing a fresh perspective to long-standing discussions. FWIW, I'm on record in multiple conversations as citing STAC as a good example for Zarr to emulate. I do think that Zarr, as an actual file format (as opposed to a catalog format) may need a somewhat more conservative attitude than STAC regarding backwards compatibility, interoperability etc. It must be very clear to data producers, for example, how to create data that will be widely readable for a long period of time without any need to update the metadata. However, I agree that our current approach to extensions basically doesn't work and is effectively preventing development. It's not even possible for Zarr Python to reach feature parity with Zarr V2 without multiple non-existent extensions (e.g. strings)--let alone innovating in new directions. So I am fully in favor of what is proposed here. One concept that may be very useful for Zarr is the notion of extension maturity: https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#extension-maturity. This would guide data providers on how "risky" it would be to adopt a specific extension. This could be seen as a more nuanced version than "must understand" True / False. I think this concept would also make obsolete my stalled proposal for Zarr "conventions": #262. I'm also strongly in favor of adopting JSON schema for metadata conformance validation. What do we need to do to move this forward? I suppose we need a ZEP propose an update the spec to redefine how extensions work. 😵💫 I'd be happy to lead that effort if it would be helpful. |
Yeah, that's the sticking point. We need some way to break the current logjam. Thinking a bit more, I guess the addition of Take consolidated metadata as an example: regardless of whether {
"zarr_format": 3,
// ...
"consolidated_metadata": {
"must_understand": false,
"name": ...,
...
},
"zarr_extensions": ["https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json"]
} Or without {
"zarr_format": 3,
// ...
"consolidated_metadata': {
"must_understand": false,
"version": "1.0.0",
...,
}
} The advantage of |
Thanks for sharing this, @TomAugspurger. I went through STAC's extension README, and I like how they've decoupled the extensions from the core. The ability to work on extensions without the involvement of the core specification authors or, in our case, the ZSC/ZIC could prove useful. Going back to conversations I had with @alimanfoo in 2022, I think Alistair envisioned something similar for extensions — the community working on their extensions unrestrictively. I also like how the STAC extensions webpage neatly lists the extensions. We could work on a similar repository/organisation for authors who would like to host their extensions under zarr-developers while also having the option to host their extensions outside of zarr-developers GitHub. We worked on the ZEP process when the Zarr community needed a mechanism to solicit feedback and move forward in a structured manner. It worked well and helped us to finalise two proposals (ZEP1 and ZEP2), but if it's proving to be a roadblock for further development, then we should make changes to it. I'm curious to hear @joshmoore and @jakirkham's thoughts. My thoughts on moving this forward: I have a PR, zarr-developers/zeps#59, which will revise the existing ZEP process. Among other changes, my PR removes the requirement of a ZEP proposal for extensions. Please check and review. 🙏🏻 I'm also happy to write or collaborate with @rabernat on a ZEP proposal outlining the new process for extensions. |
I'm not particularly (at all) familiar with the design decisions of STAC so a question: what are the trade-offs of having the new JSON object (here: Assuming embedding it under something like "extensions" is viable, it occurs to me if we could resurrect that field (which was previously in v3) by making use of Tom's example from above might look like this: {
"zarr_format": 3,
"extensions": {
"must_understand": true,
"https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json": {
"must_understand": false,
"name": "..."
},
"https://github.com/zarr-extensions/something-else/schema.json": {
"must_understand": true
}
} (If multiple objects of the same extension are needed, then this could be a list of dicts rather than a dict) The benefits would be:
|
In STAC, Where in the document the fields defined by an extension go (top level or under Requiring that extensions place their additional fields under |
I agree that a separate I do think it is valuable to avoid name collisions --- but I think we can accomplish that by using suitable unambiguous names in the top level equally as well as using such names within a nested If the goal is to define and implement extensions without any central review, then to avoid collisions, then we should use a naming scheme for any top-level metadata fields added by extensions that avoids the possibility of collisions without relying on central review. The simplest solution is to use a domain name / URL prefix under the control of the extension author. For example, you could use: {
"zarr_format": 3,
"https://github.com/TomAugspurger/consolidated-metadata": {
"must_understand": false,
...
}
} or {
"zarr_format": 3,
"github.com/TomAugspurger/consolidated-metadata": {
"must_understand": false,
...
}
} Using |
FWIW, name collisions haven't been a problem in STAC. The convention to include a prefix in your newly defined keys ( |
As mentioned in #309, I ran across some challenges with how the Zarr v3 spec does extensions. I think that we might be able to learn some lessons from how STAC handles extensions.
tl/dr: I think Zarr would benefit from a better extension story that removed the need to have any involvement from anyone other than the extension author and any tooling wishing to use that extension. JSON schema + a
zarr_extensions
field onGroup
andArray
would get us most of the way there. The current requirements ofmust_understand: false
andname: URL
in the extension objects feels like a weaker version of this.How STAC does extensibility
STAC is a JSON-based format for cataloging geospatial assets. https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#overview lays out how STAC allows itself to be extended, but there are a few key components
Collection
,Item
, etc.) include astac_version
field.Collection
,Item
) include astac_extensions
array with a list of URLs to JSON Schema definitions that can be used for validation.Together, these are sufficient to allow extensions to extend basically any part of STAC without any involvement from the core of STAC. Tooling built around STAC coordinates through
stac_extensions
For example, a validator can load the JSON schema definitions for the core metadata (using thestac_version
field) and all extensions (using the URLs instac_extensions
) and validate a document against those schemas. Libraries wishing to use some feature can check for the presence of a specificstac_extension
URL.You also get the ability to version things separately. The core metadata can be at
1.0.0
, while theproj
extension is a 2.0.0 without issue.How that might apply to Zarr
Two immediate reactions to the thought of applying that to Zarr:
Group
andArray
definitions (and possibly other fields within; STAC does this as well for, e.g.Assets
which live inside anItem
).How does this relate to what zarr has today?
I'm not sure. I was confused about some things reading https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points. The spec seems overly prescriptive about putting keys in the top level of the metadata:
STAC / JSON schema takes the opposite approach to their metadata documents. Any extra fields are allowed and ignored by default, but schemas (core or extension) can define required fields.
Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a
zarr-extensions
repository / organization for commonly used extensions / shared maintenance.The text was updated successfully, but these errors were encountered: