Non-Zarr Plugin project (seeking scope advice/recs) #220

xaviernogueira · 2023-07-27T16:30:13Z

xaviernogueira
Jul 27, 2023
Maintainer

Hi all.

I am in the process of writing a simple dataset router plugin for NetCDF source files. Essentially, it would check if the referenced xarray.Dataset.encoding['source'] is a .nc file, and if it is it would use NetCDF4 to provide file metadata not necessarily available via xarray attributes. The user could choose to hide certain attributes for security reasons if they pleased.

This originally was going to be paired with a kerchunk powered dataset provider where paths to NetCDF files can initialize a server, and JSON can be used to customize chunking schemes for whichever the use case may be.

That said, after exploring kerchunk further I realized it works just as well with other compressed formats (HDF5, GRIB, GeoTIFF). Additionally, each of those formats have Python libraries for reading their file metadata as well.

My question is about plugin scope, what would be a desirable grouping of these capabilities? Originally I was thinking the NetCDF provider and router plugins could be included in a single xpublish-netcdf package, but now that seems silly as a catalog could point to multiple file types and having a /netcdf, /grib, /tiff etc paths in the documentation would be confusing given that only one would actually work for a given dataset.

I began thinking a xpublish-file-metadata plugin could handle the dataset router plugin being agnostic to file type, and a separate xpublish-kerchunk package could house the file type agnostic dataset provider functionality. I still like this idea, but a drawback is redundant dependencies. For example, the package would need the grib, hdf5, etc. libraries installed even if the user is only dealing with a single file type.

Any thoughts on organizing these functionalities?

abkfenris · 2023-07-27T17:16:58Z

abkfenris
Jul 27, 2023
Maintainer

Mainly thinking about file metadata, going to continue pondering how that might interact with kerchunk

Hmm, each of those formats has different native ways of encoding their metadata?

In that case they probably should have different, but related paths.

To keep those with the same /dataset/{id}/file-metadata/{type} path, I quickly see two ways of setting it up:

ImportError exceptions and support all the different types internally
A mini plugin system within your plugin (that probably should handle import exceptions too).

ImportError

# plugin.py

class FileMetadata(Plugin):
    ...

    @hookimpl
    def dataset_router(self, deps: Dependencies):
        router = APIRouter(prefix=self.dataset_router_prefix, tags=self.dataset_router_tags)

        try:
            @router.get("/netcdf")
            def netcdf_medata(dataset = Depends(deps.dataset)):
                import netcdf4
                ...
        except ImportError:
            pass

        ...

        return router

Mini plugin system

The more flexible way would be to make your own plugins using the same entry point system that Xpublish plugins use. I've experimented with that a bit in xpublish-edr to make it possible for others to provide different output formats, but a similar thing could be done for new routes.

# plugin.py

class FileMetadata(Plugin):
    ...

    @hookimpl
    def dataset_router(self, deps: Dependencies):
        router = APIRouter(prefix=self.dataset_router_prefix, tags=self.dataset_router_tags)

      for entrypoint in pkg_resources.iter_entry_points("xpublish_file_metadata"):
          try:
              route_fn = entry_point.load()
              route_fn(router, deps)
          except ImportError:
              pass
        ...

        return router

Then in a 'sub-plugin' ('xpublish_file_metadata_netcdf')

# netcdf_metadata.py

def netcdf_routes(router: APIRouter, deps: Dependencies):
    @router.get("/netcdf")
    def metadata(dataset: Depends(deps.dataset)):
        ...

# pyproject.toml
[project.entry-points.xpublish_file_metadata]
netcdf = "xpublish_file_metadata_netcdf.netcdf_metadata:netcdf_routes

2 replies

xaviernogueira Jul 27, 2023
Maintainer Author

Thanks for that, good food for thought!

I may start with the ImportError setup, and feel it out, but I like the sub-plugin setup as well. The question conceptually would be whether or not future file types (or better libraries for dealing with a file type) are better handled via a PR to the plugin repo, or via writing a new plugin for the plugin. I can see merits for both patterns, although since it already is a plugin, if someone REALLY wanted to use some hyper-specific library over netCDF4 for example they could just write their own plugin and not install netCDF4 into their environment triggering the ImportError/pass logic.

Another thing to note is I'm using poetry for packaging and publishing, which allows dependencies to be grouped/named. This gives install optionality (i.e., poetry install --without netcdf,grib). That said this group pattern is not supported via pip install, so I think I'm leaning towards saying that the inverse (default is no extra dependencies) paired with descriptive ImportError messages telling the user to install X library is best.

abkfenris Jul 31, 2023
Maintainer

There are some nice features in the plugin library that we're using to allow including a fallback handler and allowing other plugins to supersede them.

I'm going to page @ocepaf as I'm just getting confused at what Poetry is doing (not the first time). Setuptools and Hatch can expose optional dependencies from pyproject.toml, that can be pip installed.

https://github.com/lincolnloop/goodconf/blob/3db76bcb6f759f7f9042b9c7fbd32e6a3af11e47/pyproject.toml#L28-L38

[project.optional-dependencies]
yaml = ["ruamel.yaml>=0.17.0"]
toml = ["tomlkit>=0.11.6"]
tests = [
    "django>=3.2.0",
    "ruamel.yaml>=0.17.0",
    "tomlkit>=0.11.6",
    "pytest==7.2.*",
    "pytest-cov==4.0.*",
    "pytest-mock==3.10.*"
]

Than then get exposed in the wheel metadata https://pypi.org/project/goodconf/#files

Requires-Python: >=3.7
Requires-Dist: pydantic<2,>=1.7
Provides-Extra: tests
Requires-Dist: django>=3.2.0; extra == 'tests'
Requires-Dist: pytest-cov==4.0.*; extra == 'tests'
Requires-Dist: pytest-mock==3.10.*; extra == 'tests'
Requires-Dist: pytest==7.2.*; extra == 'tests'
Requires-Dist: ruamel-yaml>=0.17.0; extra == 'tests'
Requires-Dist: tomlkit>=0.11.6; extra == 'tests'
Provides-Extra: toml
Requires-Dist: tomlkit>=0.11.6; extra == 'toml'
Provides-Extra: yaml
Requires-Dist: ruamel-yaml>=0.17.0; extra == 'yaml'
Description-Content-Type: text/x-rst

xaviernogueira · 2023-08-04T20:30:59Z

xaviernogueira
Aug 4, 2023
Maintainer Author

So I sort of changed the behavior I was aiming for. My new idea is the following:

/file-metadata is the plugin prefix
/file-metadata/format will return the file type regardless of whether optional dependencies are installed or not.
file-metadata/attrs will return a dictionary of all metadata attrs (except what is hidden, see below). This will fail unless the dependency is present.
file-metadata/attrs/{attr_name} will return a specific attribute only.

To hide attributes, one can instantiate the plugin class and pass it directly to xpublish. Basically, you can either pass a list of attribute names to hide regardless of which file format is being read, or you can provide a dictionary.

A key difference is that the file format-specific metadata grabbing functions are NOT routers. Rather they are functions with a signature that is checked to match a typing.Protocol. The routes are only defined at the plugin, and the underlying function is called via the identified file format.

One can pip install using the "extras" syntax, or if you install with dev group it includes all the optional dependencies. Poetry is still the packaging

1 reply

xaviernogueira Aug 4, 2023
Maintainer Author

Oh and I used the poetry version of the entry point pattern. It's very similar:

[tool.poetry.plugins."xpublish_file_metadata.formats"]
"netcdf" = "xpublish_file_metadata.formats.netcdf:NetcdfFileMetadata"
"hdf5" = "xpublish_file_metadata.formats.hdf5:Hdf5FileMetadata"
"geotiff" = "xpublish_file_metadata.formats.geotiff:GeoTiffFileMetadata"
"grib" = "xpublish_file_metadata.formats.grib:GribFileMetadata"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Zarr Plugin project (seeking scope advice/recs) #220

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Non-Zarr Plugin project (seeking scope advice/recs) #220

xaviernogueira Jul 27, 2023 Maintainer

Replies: 2 comments · 3 replies

abkfenris Jul 27, 2023 Maintainer

ImportError

Mini plugin system

xaviernogueira Jul 27, 2023 Maintainer Author

abkfenris Jul 31, 2023 Maintainer

xaviernogueira Aug 4, 2023 Maintainer Author

xaviernogueira Aug 4, 2023 Maintainer Author

xaviernogueira
Jul 27, 2023
Maintainer

Replies: 2 comments 3 replies

abkfenris
Jul 27, 2023
Maintainer

xaviernogueira Jul 27, 2023
Maintainer Author

abkfenris Jul 31, 2023
Maintainer

xaviernogueira
Aug 4, 2023
Maintainer Author

xaviernogueira Aug 4, 2023
Maintainer Author