Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config obj replace schemas #717

Draft
wants to merge 9 commits into
base: add-config-obj
Choose a base branch
from

Conversation

irm-codebase
Copy link
Contributor

@irm-codebase irm-codebase commented Nov 21, 2024

Compliments PR #704 by extending pydantic validation to other sections of model .yaml files

Summary of changes in this pull request

So far

  • Adds data_tables schema to ensure user configurations fit our setup

Reviewer checklist

  • Test(s) added to cover contribution
  • Documentation updated
  • Changelog updated
  • Coverage maintained or improved

@irm-codebase
Copy link
Contributor Author

irm-codebase commented Nov 21, 2024

@brynpickering before continuing on this, we need to ensure we have a common vision on where / how stuff is checked, and the pros/cons of certain approaches. Particularly for places were our setup conflicts with pydantic's typing-based approach.

A key assumption, based on previous discussions, is that we are attempting to remove all schema .yaml files in favour of a pydantic approach. This would lead to a final CalliopeModel composed of several pydantic structures, each replacing one of our separate schemas (data_table_schema.yaml, math_schema.yaml...).

The final object would look like this, sort of:

class CalliopeModel(BaseModel):
    """Calliope model class."""

    model_config = {"title": "calliope model"}
    config: CalliopeConfig
    techs: dict[str, Tech]
    nodes: dict[str, Node]
    data_tables: dict[str, DataTable] | None = None
    parameters: Parameter | None = None  # this might go in Math later, here for now
    math: Math | None = None  # inserted by us using build.mode???
    overrides: None | dict[str, Self] = None

Open question 1: The role of pydantic in checking user model files

From a pure SW validation / data validation perspective: should we strive to catch invalid cases with pydantic as much as possible? To what extent should pydantic be used to ensure that the data going into later stages is 'good'? What is the effect of going either way?

Some examples of trivial checks with pydantic:

  • Ensure all techs used in nodes.techs are defined in the default or in overrides.
  • Run tests to ensure that all dimensions given in dims: sections (e.g., in techs, data_tables, etc) are present in a dimensions dictionary section or in overrides of said section.
  • Ensuring that all parameter names used in techs or nodes sections are present in our definition or in user provided parameter definitions.
  • Ensuring that parameters that go 'into' the optimisation have numeric values.
  • etc

Open question 2: Where an when to introduce our basic parameter math and user extended math

At the moment a good chunk of our parameter defaults / checks is in model_def_schema.yaml, which we cram into model._data.attrs. pydantic might change this

  • Should all these these defaults be in a pydantic model?
  • How flexible should this be? Some values can be either parameters or variables depending on the mode (e.g, flow_cap).
  • Hard-defining all these parameters within the pydantic model might prove to be too inflexible. Should we do so? An alternative would be just having our modes insert them into the math section of our configuration, and pydantic just checking that they are numeric (i.e., specific parameter names being in the pydantic model vs just in the dictionary that uses it).

Open question 3: How to support user-provided 'passthrough' settings without weakening validation

This stems from issue #709. Since pydantic is type based, our grouping of items in the dictionary matters quite a bit on how strong / weak our validation becomes. Our current flat structure has advantages when it comes to user simplicity, but at a possibly heavy cost of validation strength.

As an example, our current techs definition:

supply_gas:
    name: "Natural gas import"
    color: "#C98AAD"
    carrier_out: gas
    source_use_max: .inf
    flow_cap_max: 2000
    lifetime: 25
    cost_flow_cap:
      data: 1
      index: monetary
      dims: costs
    cost_flow_in:
      data: 0.025 # 2.5p/kWh gas price #ppt
      index: monetary
      dims: costs

This would require a Tech model structured like this. The big issue here is that you need to allow extras at the same level of our hard-defined attributes! I worry this will get messy fast. A less flat structure, which separates math parameters, code-bound values (e.g., carrier_out) and user 'extras' would be stronger, but less user friendly if done poorly.

class Tech(BaseModel):
    model_config = {
        "extra": "allow"
    }
    name: str
    color: str | None = None
    carrier_out: str
    flow_cap_max: PositiveReal  = float('inf')  # Assume we define PositiveReal some place else
    ...

@irm-codebase irm-codebase marked this pull request as draft November 21, 2024 11:55
@irm-codebase irm-codebase changed the base branch from main to add-config-obj November 21, 2024 13:58
@irm-codebase
Copy link
Contributor Author

irm-codebase commented Nov 21, 2024

We can also keep all of the elements in CalliopeModel in separate places. It's only an illustration.

@brynpickering
Copy link
Member

Thanks @irm-codebase - this is what I had in mind w.r.t. a move to pydantic! One thing I think it will offer us is the ability to remove AttrDict entirely 👀... We still need to be able to load our flavour of YAML, but that could be something we do with less maintenance overhead.

RE techs, we have a few weird parameters, but keep in mind that none of them are really "hard" requirements in terms of YAML definitions. You can define all that info in data tables and then have almost nothing defined in YAML. At that point pydantic's magic is useless to us anyway, hence why I wouldn't put too much focus on maximising the value of pydantic; we'll only ever apply it to a subset of the data when data tables are used. Data validation should wait a bit longer and be applied once YAML and data table info has been merged (i.e. on the xarray arrays). This means that no, parameter settings should not be in the pydantic model (as they shouldn't be in a YAML schema). Instead, they should be in a YAML definition (e.g. as given in #712) whose structure is described by a pydantic model.

Still, we have to define some things in pydantic under tech defs since they are weird parameters structures that we let through (namely, carrier_[in/out/export]) to make the user journey easier. Otherwise, we say that a tech setting/option/parameter can be any single value or an indexed parameter. These kinds of things can be part of the pydantic model and have some fancier cross-validation done if possible, at least in terms of the dimensions that have been defined?

RE the config options we include. We should consider how best to handle overrides and templates - should they be validated separately or should the pydantic model only be generated based on the resolved dict (all overrides and template data applied)?

@irm-codebase
Copy link
Contributor Author

irm-codebase commented Nov 22, 2024

@brynpickering some replies (it's a bit difficult to follow which section you are referring to, but I'll try my best here).

RE techs, we have a few weird parameters, but keep in mind that none of them are really "hard" requirements in terms of YAML definitions. You can define all that info in data tables and then have almost nothing defined in YAML. At that point pydantic's magic is useless to us anyway, hence why I wouldn't put too much focus on maximising the value of pydantic; we'll only ever apply it to a subset of the data when data tables are used.

Understood. One important caveat of this is that we cannot easily offer some features that users might want (i.e., setting max/min ranges for a numeric parameter) without having to write this code ourselves. This is fine as long as it is understood.

If we ever wish to, we could re-arrange our parsing to happen after templates and data-tables are loaded to the dictionary (file -> our yaml loading -> complex pydantic -> backend). In this setup we would be able to rely on pydantic more.

For now, I will assume we'll go with the simplistic pydantic model.

Data validation should wait a bit longer and be applied once YAML and data table info has been merged (i.e. on the xarray arrays). This means that no, parameter settings should not be in the pydantic model (as they shouldn't be in a YAML schema). Instead, they should be in a YAML definition (e.g. as given in #712) whose structure is described by a pydantic model.

I think we pretty much agree on how this should be done, then.

Still, we have to define some things in pydantic under tech defs since they are weird parameters structures that we let through (namely, carrier_[in/out/export]) to make the user journey easier. Otherwise, we say that a tech setting/option/parameter can be any single value or an indexed parameter. These kinds of things can be part of the pydantic model and have some fancier cross-validation done if possible, at least in terms of the dimensions that have been defined?

This is the hard one. To answer the question(?) at the end: yes, but it depends. This comes back to my point on our current flat structure being troublesome. Checking certain things gets really hard when you have them all in a "salad" dictionary, even more so one can be extended by users.

As far as I can tell, this is the problem:

class CalliopeModel(BaseModel):
    """Calliope model class."""

    model_config = {"title": "calliope model"}
    config: CalliopeConfig
    techs: dict[str, Tech]
    ...

class TechFlat(BaseModel):
    model_config = {
        "extra": "allow"
    }
    # ANY other value will pass without checks! If we do not define it, we do not check it!
    # User 'extras' are a superset of what we could check, meaning that we cannot check non-specified items
    # We essentially do not check the **structure** of anything beyond hard-coded stuff (e.g., name, carrier_in)
    name: str
    carrier_in: str | None = None
    carrier_out: str | None = None
    carrier_export: str | None = None

class TechDivided(BaseModel):
    model_config = {
         "extra": "forbid"  
    }
    # May have anything as long as it's single values, to keep our IO simple
    extra: dict[str, float | str] | None = None
    # Structure is checked!
    params: dict[str, float]
    # Structure is checked!
    lookups: dict[str, str]

I am struggling a bit on how to properly setup a flat structure with passover values that does not end up feeling pointless due to the superset nature of user extras. Basically, any value can be there, even values that would not be used by some combination of our settings, with the extra ambiguity of not knowing if it is a user 'extra' is accidentally named as something important.

RE the config options we include. We should consider how best to handle overrides and templates - should they be validated separately or should the pydantic model only be generated based on the resolved dict (all overrides and template data applied)?

The answer depends on what we want to do.

  • If we want to just validate the file, with no loss of information, then both overrides and tempaltes need to be accounted for in pydantic. In this approach, we never mess with the resulting dictionary before pydantic functions are done.
  • If we want to validate the structure before it enters calliope, then we resolve these before running pydantic.

Both have pros / cons. I think the second is the easiest in our current setup.

I'd say second... solved by our yaml reading function?

@brynpickering
Copy link
Member

Understood. One important caveat of this is that we cannot easily offer some features that users might want (i.e., setting max/min ranges for a numeric parameter) without having to write this code ourselves. This is fine as long as it is understood.

That would be dealt with by the parameter config being introduced in #712 . We can have min/max or a more general way to define validation rules that will act on the xarray dataset (using the format defined in https://github.com/calliope-project/calliope/blob/959fea7096ce47dfaf6886be1e1fa9445b92bc69/src/calliope/config/model_data_checks.yaml). This is what I mean about data validation that is beyond the scope of pydantic as it is no longer working on a dictionary of data, but arrays in an xarray dataset.

This comes back to my point on our current flat structure being troublesome.

I don't think it's the flat structure that causes the problem so much as allowing for user-defined keys. One option would be to consider dynamic fields. For instance, you can set type annotations for extra fields such that they have to at least follow a specific structure (as we define with our patternproperties currently). So the flat structure for a tech could be:

class TechFlat(BaseModel):
    model_config = {
        "extra": "allow"
    }
    __pydantic_extra__: dict[RegexStrWithExclusions, IndexedParam | bool | int | float | str] = Field(union_mode='left_to_right')
    base_tech: Literal[...]
    carrier_in: str | list[str] | None = None
    carrier_out: str | list[str] | None = None
    carrier_export: str | list[str] | None = None
    # + computed fields for to/from based on `base_tech`?

@brynpickering
Copy link
Member

brynpickering commented Nov 28, 2024

To update on the idea of using __pydantic_extra__, you could do this:

DATA_T = bool | int | float | str

class ConfigBaseModel(BaseModel):
     """A base class for creating pydantic models for Calliope configuration options."""
     _kwargs: dict = {}
     
     def update(self, update_dict: dict, deep: bool = False) -> Self:
         """Return a new iteration of the model with updated fields.
         Updates are validated and stored in the parent class in the `_kwargs` key.
         Args:
             update_dict (dict): Dictionary with which to update the base model.
             deep (bool, optional): Set to True to make a deep copy of the model. Defaults to False.
         Returns:
             BaseModel: New model instance.
         """
         new_dict: dict = {}
         # Iterate through dict to be updated and convert any sub-dicts into their respective pydantic model objects
         for key, val in update_dict.items():
             key_class = getattr(self, key)
             if isinstance(key_class, ConfigBaseModel):
                 new_dict[key] = key_class._update(val)
             else:
                 new_dict[key] = val
         updated = self.model_copy(update=new_dict, deep=deep)
         updated = self.model_validate(updated.model_dump())
         self._kwargs = update_dict
         return updated
     
     def _update(self, val):
         updated = self.update(val)
         self._kwargs = val
         return updated

class Param(ConfigBaseModel):
    """Uniform dictionairy for parameters."""
    data: DATA_T | list[bool | int | float | str]
    index: list[list[str]] = [[]]
    dims: list[str] = []

    def _update(self, val):
        if isinstance(val, dict):
            updated = self.update(val)
            self._kwargs = val
            return updated
        else:
            return val
        

class Tech(ConfigBaseModel):
    model_config = {
         "extra": "allow"
     }
    __pydantic_extra__: dict[str, Param | DATA_T] = Field(union_mode='left_to_right')
    base_tech: Literal["supply"]  # etc. 
    carrier_in: str | list[str] | None = None
    carrier_out: str | list[str] | None = None
    carrier_export: str | list[str] | None = None


class Techs(ConfigBaseModel):
    model_config = {
         "extra": "allow"
     }
    __pydantic_extra__: dict[str, Tech]


class Model(ConfigBaseModel):
    techs: Techs

Which would work to manage a dict like:

{"techs": {"foo": {"bar": {"data": 1}, "baz": "foobar", "base_tech": "supply", "carrier_in": ["elec", "heat"]}}}

And would give you appropriate dot notation afterwards (e.g. techs.foo.bar.carrier_in)

@brynpickering
Copy link
Member

brynpickering commented Nov 28, 2024

@irm-codebase I've edited my previous comment based on playing around with the update method to handle the fact that an update to the data might add/remove a Param class from a technology (I left in the _kwargs stuff but realise you probably have an updated approach to this).

@irm-codebase
Copy link
Contributor Author

@brynpickering thanks for the update!
It is similar to the design I had in mind, but I was not aware of __pydantic_extra__...

Is forcing subdicts to update to their pydantic model necessary, though? I thought that was done automatically (or it seemed to be during my tests).

@brynpickering
Copy link
Member

Is forcing subdicts to update to their pydantic model necessary, though? I thought that was done automatically (or it seemed to be during my tests).

I found that it just generated subdicts in my updated pydantic model, not nested pydantic models. If there's a way to do it without the extra lines I've added, then that's great!

@irm-codebase
Copy link
Contributor Author

irm-codebase commented Dec 2, 2024

Is forcing subdicts to update to their pydantic model necessary, though? I thought that was done automatically (or it seemed to be during my tests).

I found that it just generated subdicts in my updated pydantic model, not nested pydantic models. If there's a way to do it without the extra lines I've added, then that's great!

@brynpickering seems like this can be achieved with revalidate_instances? See here

I'll see if I can get this to work on my updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants