Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lessons to learn from STAC's extensibility #316

Open
TomAugspurger opened this issue Oct 11, 2024 · 8 comments
Open

Lessons to learn from STAC's extensibility #316

TomAugspurger opened this issue Oct 11, 2024 · 8 comments

Comments

@TomAugspurger
Copy link

As mentioned in #309, I ran across some challenges with how the Zarr v3 spec does extensions. I think that we might be able to learn some lessons from how STAC handles extensions.


tl/dr: I think Zarr would benefit from a better extension story that removed the need to have any involvement from anyone other than the extension author and any tooling wishing to use that extension. JSON schema + a zarr_extensions field on Group and Array would get us most of the way there. The current requirements of must_understand: false and name: URL in the extension objects feels like a weaker version of this.


How STAC does extensibility

STAC is a JSON-based format for cataloging geospatial assets. https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#overview lays out how STAC allows itself to be extended, but there are a few key components

  1. STAC uses jsonschema to define schemes for both the core metadata and extensions.
  2. All STAC objects (Collection, Item, etc.) include a stac_version field.
  3. All STAC objects (Collection, Item) include a stac_extensions array with a list of URLs to JSON Schema definitions that can be used for validation.

Together, these are sufficient to allow extensions to extend basically any part of STAC without any involvement from the core of STAC. Tooling built around STAC coordinates through stac_extensions For example, a validator can load the JSON schema definitions for the core metadata (using the stac_version field) and all extensions (using the URLs in stac_extensions) and validate a document against those schemas. Libraries wishing to use some feature can check for the presence of a specific stac_extension URL.

You also get the ability to version things separately. The core metadata can be at 1.0.0, while the proj extension is a 2.0.0 without issue.

How that might apply to Zarr

Two immediate reactions to the thought of applying that to Zarr:

  1. Zarr does have JSON documents for describing the metadata of nodes in a Zarr hierarchy. We could pretty easily take the same concepts and apply them more or less directly to the Group and Array definitions (and possibly other fields within; STAC does this as well for, e.g. Assets which live inside an Item).
  2. STAC is entirely JSON-based, while much of Zarr concerns how binary blobs are stored, transformed, etc. While portions of these extension points might be configured (and validated by JSON schema) in the metadata document, much of it will lie outside.

How does this relate to what zarr has today?

I'm not sure. I was confused about some things reading https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points. The spec seems overly prescriptive about putting keys in the top level of the metadata:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

STAC / JSON schema takes the opposite approach to their metadata documents. Any extra fields are allowed and ignored by default, but schemas (core or extension) can define required fields.

Specifications for new extensions are recommended to be published in the zarr-developers/zarr-specs repository via the ZEP process. If a specification is published decentralized (e.g. for initial experimentation or due to a very specialized scope), it must use a URL in the name key of its metadata, which identifies the publishing organization or individual, and should point to the specification of the extension.

Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a zarr-extensions repository / organization for commonly used extensions / shared maintenance.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 11, 2024

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

Worth noting that the first and third sentences are blatantly contradictory! 🙃

Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a zarr-extensions repository / organization for commonly used extensions / shared maintenance.

💯 this sounds like a great idea. I think requiring a ZEP for every extension is a headache and the end result will be that nobody does it. I'd be happy adjusting #312 along the lines of a separate zarr-extensions repo if people generally think that's a good idea.

@rabernat
Copy link
Contributor

Thanks for sharing this Tom. It has been great to have you spending time on Zarr recently and bringing a fresh perspective to long-standing discussions. FWIW, I'm on record in multiple conversations as citing STAC as a good example for Zarr to emulate.

I do think that Zarr, as an actual file format (as opposed to a catalog format) may need a somewhat more conservative attitude than STAC regarding backwards compatibility, interoperability etc. It must be very clear to data producers, for example, how to create data that will be widely readable for a long period of time without any need to update the metadata.

However, I agree that our current approach to extensions basically doesn't work and is effectively preventing development. It's not even possible for Zarr Python to reach feature parity with Zarr V2 without multiple non-existent extensions (e.g. strings)--let alone innovating in new directions. So I am fully in favor of what is proposed here.

One concept that may be very useful for Zarr is the notion of extension maturity: https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#extension-maturity. This would guide data providers on how "risky" it would be to adopt a specific extension. This could be seen as a more nuanced version than "must understand" True / False.

I think this concept would also make obsolete my stalled proposal for Zarr "conventions": #262.

I'm also strongly in favor of adopting JSON schema for metadata conformance validation.


What do we need to do to move this forward? I suppose we need a ZEP propose an update the spec to redefine how extensions work. 😵‍💫 I'd be happy to lead that effort if it would be helpful.

@TomAugspurger
Copy link
Author

I suppose we need a ZEP propose an update the spec to redefine how extensions work

Yeah, that's the sticking point. We need some way to break the current logjam.

Thinking a bit more, I guess the addition of zarr_extensions array is only necessary if we also intend to use jsonschema for validation for both the core metadata and extensions. I think the main thing to figure out is how the different fields that make up the final object are versioned (and potentially validated against a schema).

Take consolidated metadata as an example: regardless of whether zarr_extensions is used, you'll end up with a similar metadata document for a Group. For example, with zarr_extensions:

{
  "zarr_format": 3,
  // ...
  "consolidated_metadata": {
    "must_understand": false,
    "name": ...,
    ...
  },
  "zarr_extensions": ["https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json"]
}

Or without zarr_extensions, with the version of the consolidated metadata extension inlined:

{
  "zarr_format": 3,
  // ...
  "consolidated_metadata': {
    "must_understand": false,
    "version": "1.0.0",
    ...,
  }
}

The advantage of zarr_extensions is a uniform way for tools to validate the contents of core and extension metadata. Whether or not trying to introduce something like that at this stage of zarr v3, I'm not sure.

@MSanKeys963
Copy link
Member

Thanks for sharing this, @TomAugspurger. I went through STAC's extension README, and I like how they've decoupled the extensions from the core. The ability to work on extensions without the involvement of the core specification authors or, in our case, the ZSC/ZIC could prove useful.

Going back to conversations I had with @alimanfoo in 2022, I think Alistair envisioned something similar for extensions — the community working on their extensions unrestrictively.

I also like how the STAC extensions webpage neatly lists the extensions. We could work on a similar repository/organisation for authors who would like to host their extensions under zarr-developers while also having the option to host their extensions outside of zarr-developers GitHub.

We worked on the ZEP process when the Zarr community needed a mechanism to solicit feedback and move forward in a structured manner. It worked well and helped us to finalise two proposals (ZEP1 and ZEP2), but if it's proving to be a roadblock for further development, then we should make changes to it.

I'm curious to hear @joshmoore and @jakirkham's thoughts.


My thoughts on moving this forward: I have a PR, zarr-developers/zeps#59, which will revise the existing ZEP process. Among other changes, my PR removes the requirement of a ZEP proposal for extensions. Please check and review. 🙏🏻

I'm also happy to write or collaborate with @rabernat on a ZEP proposal outlining the new process for extensions.

@joshmoore
Copy link
Member

regardless of whether zarr_extensions is used, you'll end up with a similar metadata document for a Group.

I'm not particularly (at all) familiar with the design decisions of STAC so a question: what are the trade-offs of having the new JSON object (here: consolidated_metadata) at the top-level and not within the extensions object itself?

Assuming embedding it under something like "extensions" is viable, it occurs to me if we could resurrect that field (which was previously in v3) by making use of must_understand recursively. The field "extension" would make use of the extension (no quotes) mechanism itself. Further extensions (if that's too confusing, then another name like plugins, etc.) could be embedded in that object. They in turn have a "must_understand" field and that if ANY of those is True, then the top-level is true as well.

Tom's example from above might look like this:

{
  "zarr_format": 3,
  "extensions": {
    "must_understand": true,
    "https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json": {
      "must_understand": false,
      "name": "..."
    },
    "https://github.com/zarr-extensions/something-else/schema.json": {
      "must_understand": true
    }
}

(If multiple objects of the same extension are needed, then this could be a list of dicts rather than a dict)

The benefits would be:

  • we introduce a clear place for all extensions
  • we make use of the existing v3 must_understand logic to not break
  • No namespace collisions (e.g., from two extensions which define the same name)

@TomAugspurger
Copy link
Author

what are the trade-offs of having the new JSON object (here: consolidated_metadata) at the top-level and not within the extensions object itself?

In STAC, stac_extensions is an array (of URLs to jsonschema definitions), not an object.

Where in the document the fields defined by an extension go (top level or under extensions) doesn't matter from the point of view of json schema: you just need to ensure that the definition matches the usage.

Requiring that extensions place their additional fields under extensions only helps with namespace collisions between an extension's field and the core spec (including future versions of the spec). It doesn't help with collisions between extensions, at least not at the json schema level. You could require by convention that all extensions use a namespace, but that's just a convention.

@jbms
Copy link
Contributor

jbms commented Nov 14, 2024

I agree that a separate extensions object doesn't necessarily help --- I argued against that previously because I don't see a strong benefit in distinguishing between what was in the first version of the core spec and what is added in subsequent versions.

I do think it is valuable to avoid name collisions --- but I think we can accomplish that by using suitable unambiguous names in the top level equally as well as using such names within a nested extensions object.

If the goal is to define and implement extensions without any central review, then to avoid collisions, then we should use a naming scheme for any top-level metadata fields added by extensions that avoids the possibility of collisions without relying on central review. The simplest solution is to use a domain name / URL prefix under the control of the extension author. For example, you could use:

{
  "zarr_format": 3,
  "https://github.com/TomAugspurger/consolidated-metadata": {
    "must_understand": false,
   ...
  }
}

or

{
  "zarr_format": 3,
  "github.com/TomAugspurger/consolidated-metadata": {
    "must_understand": false,
   ...
  }
}

Using https://github.com/zarr-extensions/... would imply at least the approval of whoever is managing that github organization. Maybe the barrier for that could be extremely low, e.g. first come, first serve. But it is probably simpler to avoid even that level of central review for extensions intended not to be centrally reviewed.

@TomAugspurger
Copy link
Author

FWIW, name collisions haven't been a problem in STAC. The convention to include a prefix in your newly defined keys (proj:shape, for the shape field defined by the projection extension) is widely followed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants