Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic model for deployment config #211

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

VeckoTheGecko
Copy link
Contributor

Adding Pydantic model for ingesting and validating the deployment config.

This is a first draft so that maintainers can provide feedback and insight into how the fields are used. We can use Pydantic to provide custom validation.

Feel free to push to this branch. You have a better idea on what the configs should be. I highly recommend that we use the objects provided in the YAML specification as much as possible (e.g., for dates we can provide date objects rather than strings. See "Example 2.22 Timestamps" of the specification) to streamline ingestion.

@callumrollo @jklymak

@VeckoTheGecko
Copy link
Contributor Author

Pydantic has a rich system for validation, so I recommend we work inside that https://docs.pydantic.dev/latest/concepts/validators/#before-after-wrap-and-plain-validators.

@jklymak
Copy link
Member

jklymak commented Dec 12, 2024

So... I don't think this is in the spirit of how pyglider is meant to work. We are not in the business of enforcing the metadata that gets put into to netcdf files. Indeed we want to be as flexible as possible to allow users to write metadata to suit whatever purpose they have.

Perhaps I'm misunderstanding, but this pr seems to be enforcing the metadata standard we currently have an example of. I think glider can and should provide examples of metadata compliant yaml files, I don't think we want to be another place where compliance is checked and try to keep that compliance up to date with changes to other compliance checkers.

@VeckoTheGecko
Copy link
Contributor Author

VeckoTheGecko commented Dec 13, 2024

I should have explained better in the original description. It was close to end of day when I made this so it was a bit rushed :) .

Most of the models in this PR were populated by an LLM based on the example config. This is only to act as a starting point for discussion. I thought it would be good to put make this PR for the tests, config loading, and to act as a place to discuss the models.

We are not in the business of enforcing the metadata that gets put into to netcdf files. ... Indeed we want to be as flexible as possible to allow users to write metadata to suit whatever purpose they have.

I 100% agree.

The motivation of this PR isn't to validate the metadata, but more-so to document and act as a single source of truth for defining the deployment config schema (and data types). So that users and developers can look at one file and see:

  1. which parts of the schema are free form key: value pairs
  2. which parts have optional keys
  3. which parts have expected keys
  4. which parts have expected keys and types for those values, and
  5. which parts have expected keys, but also values that have to meet certain criteria (e.g., being an integer)

By having a single source of truth it:

  • Separates the loading and type casting from the using of the data.
  • Makes the schema explicit rather than implicit. This prevents breaking changes (e.g., deep in the codebase, a variable suddenly becomes expected since it becomes referenced directly)

This is to provide more intuitive error messages for users (e.g., getting a "Your YAML file is incorrect. Missing field ...") rather than getting a key error, value error, or type error at runtime. If (for example in the case of (5)), a date needed to be read in for data processing purposes, and that date is expected to be of form "dd/mm/yyyy" it would be good to error out at the point of config loading instead of runtime if that date is improperly provided (e.g., yyyy-mm-dd or "yyyy-mm-dd").

I'm not familiar with this project, so I need help to define the models/schema.

Hopefully that makes sense. I don't think we'd need to use the full validation capabilities of pydantic (i.e., (5) - since I expect the only data used in actual data processing would be calibration numbers, with the rest mainly for metadata) but I do think its handy for clearly defining the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants