Add config obj replace schemas #717

irm-codebase · 2024-11-21T09:31:01Z

Compliments PR #704 by extending pydantic validation to other sections of model .yaml files

Summary of changes in this pull request

So far

Adds data_tables schema to ensure user configurations fit our setup

Reviewer checklist

Test(s) added to cover contribution
Documentation updated
Changelog updated
Coverage maintained or improved

Moved config.py to 'src/calliope/schemas'. config.py is too generic, and conflicts with config variable names.

irm-codebase · 2024-11-21T11:06:07Z

@brynpickering before continuing on this, we need to ensure we have a common vision on where / how stuff is checked, and the pros/cons of certain approaches. Particularly for places were our setup conflicts with pydantic's typing-based approach.

A key assumption, based on previous discussions, is that we are attempting to remove all schema .yaml files in favour of a pydantic approach. This would lead to a final CalliopeModel composed of several pydantic structures, each replacing one of our separate schemas (data_table_schema.yaml, math_schema.yaml...).

The final object would look like this, sort of:

class CalliopeModel(BaseModel):
    """Calliope model class."""

    model_config = {"title": "calliope model"}
    config: CalliopeConfig
    techs: dict[str, Tech]
    nodes: dict[str, Node]
    data_tables: dict[str, DataTable] | None = None
    parameters: Parameter | None = None  # this might go in Math later, here for now
    math: Math | None = None  # inserted by us using build.mode???
    overrides: None | dict[str, Self] = None

Open question 1: The role of `pydantic` in checking user model files

From a pure SW validation / data validation perspective: should we strive to catch invalid cases with pydantic as much as possible? To what extent should pydantic be used to ensure that the data going into later stages is 'good'? What is the effect of going either way?

Some examples of trivial checks with pydantic:

Ensure all techs used in nodes.techs are defined in the default or in overrides.
Run tests to ensure that all dimensions given in dims: sections (e.g., in techs, data_tables, etc) are present in a dimensions dictionary section or in overrides of said section.
Ensuring that all parameter names used in techs or nodes sections are present in our definition or in user provided parameter definitions.
Ensuring that parameters that go 'into' the optimisation have numeric values.
etc

Open question 2: Where an when to introduce our basic parameter math and user extended math

At the moment a good chunk of our parameter defaults / checks is in model_def_schema.yaml, which we cram into model._data.attrs. pydantic might change this

Should all these these defaults be in a pydantic model?
How flexible should this be? Some values can be either parameters or variables depending on the mode (e.g, flow_cap).
Hard-defining all these parameters within the pydantic model might prove to be too inflexible. Should we do so? An alternative would be just having our modes insert them into the math section of our configuration, and pydantic just checking that they are numeric (i.e., specific parameter names being in the pydantic model vs just in the dictionary that uses it).

Open question 3: How to support user-provided 'passthrough' settings without weakening validation

This stems from issue #709. Since pydantic is type based, our grouping of items in the dictionary matters quite a bit on how strong / weak our validation becomes. Our current flat structure has advantages when it comes to user simplicity, but at a possibly heavy cost of validation strength.

As an example, our current techs definition:

supply_gas:
    name: "Natural gas import"
    color: "#C98AAD"
    carrier_out: gas
    source_use_max: .inf
    flow_cap_max: 2000
    lifetime: 25
    cost_flow_cap:
      data: 1
      index: monetary
      dims: costs
    cost_flow_in:
      data: 0.025 # 2.5p/kWh gas price #ppt
      index: monetary
      dims: costs

This would require a Tech model structured like this. The big issue here is that you need to allow extras at the same level of our hard-defined attributes! I worry this will get messy fast. A less flat structure, which separates math parameters, code-bound values (e.g., carrier_out) and user 'extras' would be stronger, but less user friendly if done poorly.

class Tech(BaseModel):
    model_config = {
        "extra": "allow"
    }
    name: str
    color: str | None = None
    carrier_out: str
    flow_cap_max: PositiveReal  = float('inf')  # Assume we define PositiveReal some place else
    ...

irm-codebase · 2024-11-21T14:24:59Z

We can also keep all of the elements in CalliopeModel in separate places. It's only an illustration.

brynpickering · 2024-11-21T20:25:30Z

Thanks @irm-codebase - this is what I had in mind w.r.t. a move to pydantic! One thing I think it will offer us is the ability to remove AttrDict entirely 👀... We still need to be able to load our flavour of YAML, but that could be something we do with less maintenance overhead.

RE techs, we have a few weird parameters, but keep in mind that none of them are really "hard" requirements in terms of YAML definitions. You can define all that info in data tables and then have almost nothing defined in YAML. At that point pydantic's magic is useless to us anyway, hence why I wouldn't put too much focus on maximising the value of pydantic; we'll only ever apply it to a subset of the data when data tables are used. Data validation should wait a bit longer and be applied once YAML and data table info has been merged (i.e. on the xarray arrays). This means that no, parameter settings should not be in the pydantic model (as they shouldn't be in a YAML schema). Instead, they should be in a YAML definition (e.g. as given in #712) whose structure is described by a pydantic model.

Still, we have to define some things in pydantic under tech defs since they are weird parameters structures that we let through (namely, carrier_[in/out/export]) to make the user journey easier. Otherwise, we say that a tech setting/option/parameter can be any single value or an indexed parameter. These kinds of things can be part of the pydantic model and have some fancier cross-validation done if possible, at least in terms of the dimensions that have been defined?

RE the config options we include. We should consider how best to handle overrides and templates - should they be validated separately or should the pydantic model only be generated based on the resolved dict (all overrides and template data applied)?

irm-codebase · 2024-11-22T15:01:44Z

@brynpickering some replies (it's a bit difficult to follow which section you are referring to, but I'll try my best here).

RE techs, we have a few weird parameters, but keep in mind that none of them are really "hard" requirements in terms of YAML definitions. You can define all that info in data tables and then have almost nothing defined in YAML. At that point pydantic's magic is useless to us anyway, hence why I wouldn't put too much focus on maximising the value of pydantic; we'll only ever apply it to a subset of the data when data tables are used.

Understood. One important caveat of this is that we cannot easily offer some features that users might want (i.e., setting max/min ranges for a numeric parameter) without having to write this code ourselves. This is fine as long as it is understood.

If we ever wish to, we could re-arrange our parsing to happen after templates and data-tables are loaded to the dictionary (file -> our yaml loading -> complex pydantic -> backend). In this setup we would be able to rely on pydantic more.

For now, I will assume we'll go with the simplistic pydantic model.

Data validation should wait a bit longer and be applied once YAML and data table info has been merged (i.e. on the xarray arrays). This means that no, parameter settings should not be in the pydantic model (as they shouldn't be in a YAML schema). Instead, they should be in a YAML definition (e.g. as given in #712) whose structure is described by a pydantic model.

I think we pretty much agree on how this should be done, then.

Still, we have to define some things in pydantic under tech defs since they are weird parameters structures that we let through (namely, carrier_[in/out/export]) to make the user journey easier. Otherwise, we say that a tech setting/option/parameter can be any single value or an indexed parameter. These kinds of things can be part of the pydantic model and have some fancier cross-validation done if possible, at least in terms of the dimensions that have been defined?

This is the hard one. To answer the question(?) at the end: yes, but it depends. This comes back to my point on our current flat structure being troublesome. Checking certain things gets really hard when you have them all in a "salad" dictionary, even more so one can be extended by users.

As far as I can tell, this is the problem:

class CalliopeModel(BaseModel):
    """Calliope model class."""

    model_config = {"title": "calliope model"}
    config: CalliopeConfig
    techs: dict[str, Tech]
    ...

class TechFlat(BaseModel):
    model_config = {
        "extra": "allow"
    }
    # ANY other value will pass without checks! If we do not define it, we do not check it!
    # User 'extras' are a superset of what we could check, meaning that we cannot check non-specified items
    # We essentially do not check the **structure** of anything beyond hard-coded stuff (e.g., name, carrier_in)
    name: str
    carrier_in: str | None = None
    carrier_out: str | None = None
    carrier_export: str | None = None

class TechDivided(BaseModel):
    model_config = {
         "extra": "forbid"  
    }
    # May have anything as long as it's single values, to keep our IO simple
    extra: dict[str, float | str] | None = None
    # Structure is checked!
    params: dict[str, float]
    # Structure is checked!
    lookups: dict[str, str]

I am struggling a bit on how to properly setup a flat structure with passover values that does not end up feeling pointless due to the superset nature of user extras. Basically, any value can be there, even values that would not be used by some combination of our settings, with the extra ambiguity of not knowing if it is a user 'extra' is accidentally named as something important.

RE the config options we include. We should consider how best to handle overrides and templates - should they be validated separately or should the pydantic model only be generated based on the resolved dict (all overrides and template data applied)?

The answer depends on what we want to do.

If we want to just validate the file, with no loss of information, then both overrides and tempaltes need to be accounted for in pydantic. In this approach, we never mess with the resulting dictionary before pydantic functions are done.
If we want to validate the structure before it enters calliope, then we resolve these before running pydantic.

Both have pros / cons. I think the second is the easiest in our current setup.

I'd say second... solved by our yaml reading function?

brynpickering · 2024-11-27T13:06:14Z

Understood. One important caveat of this is that we cannot easily offer some features that users might want (i.e., setting max/min ranges for a numeric parameter) without having to write this code ourselves. This is fine as long as it is understood.

That would be dealt with by the parameter config being introduced in #712 . We can have min/max or a more general way to define validation rules that will act on the xarray dataset (using the format defined in https://github.com/calliope-project/calliope/blob/959fea7096ce47dfaf6886be1e1fa9445b92bc69/src/calliope/config/model_data_checks.yaml). This is what I mean about data validation that is beyond the scope of pydantic as it is no longer working on a dictionary of data, but arrays in an xarray dataset.

This comes back to my point on our current flat structure being troublesome.

I don't think it's the flat structure that causes the problem so much as allowing for user-defined keys. One option would be to consider dynamic fields. For instance, you can set type annotations for extra fields such that they have to at least follow a specific structure (as we define with our patternproperties currently). So the flat structure for a tech could be:

class TechFlat(BaseModel):
    model_config = {
        "extra": "allow"
    }
    __pydantic_extra__: dict[RegexStrWithExclusions, IndexedParam | bool | int | float | str] = Field(union_mode='left_to_right')
    base_tech: Literal[...]
    carrier_in: str | list[str] | None = None
    carrier_out: str | list[str] | None = None
    carrier_export: str | list[str] | None = None
    # + computed fields for to/from based on `base_tech`?

brynpickering · 2024-11-28T12:40:14Z

To update on the idea of using __pydantic_extra__, you could do this:

DATA_T = bool | int | float | str

class ConfigBaseModel(BaseModel):
     """A base class for creating pydantic models for Calliope configuration options."""
     _kwargs: dict = {}
     
     def update(self, update_dict: dict, deep: bool = False) -> Self:
         """Return a new iteration of the model with updated fields.
         Updates are validated and stored in the parent class in the `_kwargs` key.
         Args:
             update_dict (dict): Dictionary with which to update the base model.
             deep (bool, optional): Set to True to make a deep copy of the model. Defaults to False.
         Returns:
             BaseModel: New model instance.
         """
         new_dict: dict = {}
         # Iterate through dict to be updated and convert any sub-dicts into their respective pydantic model objects
         for key, val in update_dict.items():
             key_class = getattr(self, key)
             if isinstance(key_class, ConfigBaseModel):
                 new_dict[key] = key_class._update(val)
             else:
                 new_dict[key] = val
         updated = self.model_copy(update=new_dict, deep=deep)
         updated = self.model_validate(updated.model_dump())
         self._kwargs = update_dict
         return updated
     
     def _update(self, val):
         updated = self.update(val)
         self._kwargs = val
         return updated

class Param(ConfigBaseModel):
    """Uniform dictionairy for parameters."""
    data: DATA_T | list[bool | int | float | str]
    index: list[list[str]] = [[]]
    dims: list[str] = []

    def _update(self, val):
        if isinstance(val, dict):
            updated = self.update(val)
            self._kwargs = val
            return updated
        else:
            return val
        

class Tech(ConfigBaseModel):
    model_config = {
         "extra": "allow"
     }
    __pydantic_extra__: dict[str, Param | DATA_T] = Field(union_mode='left_to_right')
    base_tech: Literal["supply"]  # etc. 
    carrier_in: str | list[str] | None = None
    carrier_out: str | list[str] | None = None
    carrier_export: str | list[str] | None = None


class Techs(ConfigBaseModel):
    model_config = {
         "extra": "allow"
     }
    __pydantic_extra__: dict[str, Tech]


class Model(ConfigBaseModel):
    techs: Techs

Which would work to manage a dict like:

{"techs": {"foo": {"bar": {"data": 1}, "baz": "foobar", "base_tech": "supply", "carrier_in": ["elec", "heat"]}}}

And would give you appropriate dot notation afterwards (e.g. techs.foo.bar.carrier_in)

brynpickering · 2024-11-28T13:54:01Z

@irm-codebase I've edited my previous comment based on playing around with the update method to handle the fact that an update to the data might add/remove a Param class from a technology (I left in the _kwargs stuff but realise you probably have an updated approach to this).

irm-codebase · 2024-11-28T19:17:54Z

@brynpickering thanks for the update!
It is similar to the design I had in mind, but I was not aware of __pydantic_extra__...

Is forcing subdicts to update to their pydantic model necessary, though? I thought that was done automatically (or it seemed to be during my tests).

brynpickering · 2024-11-29T10:04:44Z

Is forcing subdicts to update to their pydantic model necessary, though? I thought that was done automatically (or it seemed to be during my tests).

I found that it just generated subdicts in my updated pydantic model, not nested pydantic models. If there's a way to do it without the extra lines I've added, then that's great!

irm-codebase · 2024-12-02T10:51:15Z

Is forcing subdicts to update to their pydantic model necessary, though? I thought that was done automatically (or it seemed to be during my tests).

I found that it just generated subdicts in my updated pydantic model, not nested pydantic models. If there's a way to do it without the extra lines I've added, then that's great!

@brynpickering seems like this can be achieved with revalidate_instances? See here

I'll see if I can get this to work on my updates.

irm-codebase added 8 commits November 15, 2024 10:16

Removed mode redundancy to simplify the configuration

e59e5e7

ruff fixes

e033b74

Simplify config schema extraction

e2a051a

Add schemas folder

56e1126

Moved config.py to 'src/calliope/schemas'. config.py is too generic, and conflicts with config variable names.

Add data table schema w/ tests

46ae5a0

Use Annotated for regex strings

3186660

Merge branch 'add-config-obj' into add-config-obj-replace-schemas

3a71e97

fix yaml call merge bug

2c9a36d

irm-codebase requested a review from brynpickering November 21, 2024 09:31

irm-codebase marked this pull request as draft November 21, 2024 11:55

irm-codebase changed the base branch from main to add-config-obj November 21, 2024 13:58

Fix attr dict call to yaml

a88c18d

irm-codebase mentioned this pull request Dec 2, 2024

Feature attrdict and model definition declutter #719

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add config obj replace schemas #717

Add config obj replace schemas #717

irm-codebase commented Nov 21, 2024 •

edited

Loading

irm-codebase commented Nov 21, 2024 •

edited

Loading

irm-codebase commented Nov 21, 2024 •

edited

Loading

brynpickering commented Nov 21, 2024

irm-codebase commented Nov 22, 2024 •

edited

Loading

brynpickering commented Nov 27, 2024

brynpickering commented Nov 28, 2024 •

edited

Loading

brynpickering commented Nov 28, 2024 •

edited

Loading

irm-codebase commented Nov 28, 2024

brynpickering commented Nov 29, 2024

irm-codebase commented Dec 2, 2024 •

edited

Loading

Add config obj replace schemas #717

Are you sure you want to change the base?

Add config obj replace schemas #717

Conversation

irm-codebase commented Nov 21, 2024 • edited Loading

Summary of changes in this pull request

Reviewer checklist

irm-codebase commented Nov 21, 2024 • edited Loading

Open question 1: The role of pydantic in checking user model files

Open question 2: Where an when to introduce our basic parameter math and user extended math

Open question 3: How to support user-provided 'passthrough' settings without weakening validation

irm-codebase commented Nov 21, 2024 • edited Loading

brynpickering commented Nov 21, 2024

irm-codebase commented Nov 22, 2024 • edited Loading

brynpickering commented Nov 27, 2024

brynpickering commented Nov 28, 2024 • edited Loading

brynpickering commented Nov 28, 2024 • edited Loading

irm-codebase commented Nov 28, 2024

brynpickering commented Nov 29, 2024

irm-codebase commented Dec 2, 2024 • edited Loading

irm-codebase commented Nov 21, 2024 •

edited

Loading

irm-codebase commented Nov 21, 2024 •

edited

Loading

Open question 1: The role of `pydantic` in checking user model files

irm-codebase commented Nov 21, 2024 •

edited

Loading

irm-codebase commented Nov 22, 2024 •

edited

Loading

brynpickering commented Nov 28, 2024 •

edited

Loading

brynpickering commented Nov 28, 2024 •

edited

Loading

irm-codebase commented Dec 2, 2024 •

edited

Loading