You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Idea: Save Dataset Creation Config as YAML Attribute for Improved Reproducibility
Description:
To improve the reproducibility and traceability of datasets created using mllam-data-prep, we should save the dataset creation configuration directly as an attribute of the dataset itself. This will enable proper handling of mismatches between the saved dataset and config used to create it.
Currently, the user will only get a warning that no new datastore was created and the old existing one will be used instead.
Key points:
Serialize the dataset creation config (self._config) to YAML format
Save the serialized YAML config as an attribute of the dataset (self._ds)
This provides a record of the exact settings used to generate the dataset
Enables detecting mismatches between dataset and config
Improves reproducibility by allowing datasets to be recreated from the saved config
Implementation:
In the MDPDatastore constructor, serialize the dataset creation config (self._config) to YAML:
importyaml# Serialize config to YAML stringconfig_yaml=yaml.dump(self._config)
Save the YAML config string as a dataset attribute, e.g. "creation_config":
self._ds.attrs["creation_config"] =config_yaml
When loading datasets (e.g. in get_dataarray), check for the presence of the "creation_config" attribute
If present, deserialize the YAML back to a config object:
To improve the reproducibility and traceability of datasets created using mllam-data-prep, we should save the dataset creation configuration directly as an attribute of the dataset itself. This will enable proper handling of mismatches between the saved dataset and config used to create it.
I completely agree with this! I was just not sure about the format to do it in, I was wondering if just using yaml would be a "dirty" approach somehow 😆 I would propose that serialisation of the config itself into the dataset should be added to mllam-data-prep and the logic on whether to recreate the dataset or not should maybe reside in nl (that's what the cool kids call neural-lam these days I have heard)
Idea: Save Dataset Creation Config as YAML Attribute for Improved Reproducibility
Description:
To improve the reproducibility and traceability of datasets created using mllam-data-prep, we should save the dataset creation configuration directly as an attribute of the dataset itself. This will enable proper handling of mismatches between the saved dataset and config used to create it.
Currently, the user will only get a warning that no new datastore was created and the old existing one will be used instead.
Key points:
Implementation:
The text was updated successfully, but these errors were encountered: