Updates for datastore examples and neural-lam config

Adds a number of interconnected changes: - Updated README with (almost complete) instructions for using neural-lam with datastores and the dataclasses-based configs. Includes instructions on how to make neural-lam and datastore configs for with MEPS dataset released with v0.1.0 of neural-lam. - Move DANRA datastore test example so that the directory structure follows `tests/datastore_examples/<datastore-shortname>/<example-name>` for both the DANRA and MEPS examples - Finalise configuration for loss weighting configuration - Update npyfilesmeps datastore testcase test-dataset to match the changes in neural-lam for supporting datastores (this include changes to the npyfilesmeps config). The actual file isn't yet updated in `mllam-testdata` bucket on AWS S3 (that will follow in separate commit with URL update)
mllam · Nov 12, 2024 · 731910f · 731910f
1 parent 8cc6c3d
commit 731910f
Show file tree

Hide file tree

Showing 14 changed files with 242 additions and 137 deletions.
diff --git a/.gitignore b/.gitignore
@@ -79,6 +79,7 @@ tags
 
 # macos
 .DS_Store
+__MACOSX
 
 # pdm (https://pdm-project.org/en/stable/)
 .pdm-python

diff --git a/README.md b/README.md
@@ -150,7 +150,10 @@ training:
       v100m: 1.0
 ```
 
-For now the neural-lam config only defines two things: 1) the kind of data store and the path to its config, and 2) the weighting of different features in the loss function.
+For now the neural-lam config only defines two things: 1) the kind of data
+store and the path to its config, and 2) the weighting of different features in
+the loss function. If you don't define the state feature weighting it will default
+to weighting all features equally.
 
 (This example is taken from the `tests/datastore_examples/mdp` directory.)
 
@@ -270,11 +273,88 @@ Graphs used in the initial paper are also available for download at the same lin
 Note that this is far too little data to train any useful models, but all pre-processing and training steps can be run with it.
 It should thus be useful to make sure that your python environment is set up correctly and that all the code can be ran without any issues.
 
+The following datastore configuration works with MEPS dataset:
+
 ```yaml
 # meps.datastore.yaml
+dataset:
+  name: meps_example
+  num_forcing_features: 16
+  var_longnames:
+  - pres_heightAboveGround_0_instant
+  - pres_heightAboveSea_0_instant
+  - nlwrs_heightAboveGround_0_accum
+  - nswrs_heightAboveGround_0_accum
+  - r_heightAboveGround_2_instant
+  - r_hybrid_65_instant
+  - t_heightAboveGround_2_instant
+  - t_hybrid_65_instant
+  - t_isobaricInhPa_500_instant
+  - t_isobaricInhPa_850_instant
+  - u_hybrid_65_instant
+  - u_isobaricInhPa_850_instant
+  - v_hybrid_65_instant
+  - v_isobaricInhPa_850_instant
+  - wvint_entireAtmosphere_0_instant
+  - z_isobaricInhPa_1000_instant
+  - z_isobaricInhPa_500_instant
+  var_names:
+  - pres_0g
+  - pres_0s
+  - nlwrs_0
+  - nswrs_0
+  - r_2
+  - r_65
+  - t_2
+  - t_65
+  - t_500
+  - t_850
+  - u_65
+  - u_850
+  - v_65
+  - v_850
+  - wvint_0
+  - z_1000
+  - z_500
+  var_units:
+  - Pa
+  - Pa
+  - W/m\textsuperscript{2}
+  - W/m\textsuperscript{2}
+  - "-"
+  - "-"
+  - K
+  - K
+  - K
+  - K
+  - m/s
+  - m/s
+  - m/s
+  - m/s
+  - kg/m\textsuperscript{2}
+  - m\textsuperscript{2}/s\textsuperscript{2}
+  - m\textsuperscript{2}/s\textsuperscript{2}
+  num_timesteps: 65
+  num_ensemble_members: 2
+  step_length: 3
+  remove_state_features_with_index: [15]
+grid_shape_state:
+- 268
+- 238
+projection:
+  class_name: LambertConformal
+  kwargs:
+    central_latitude: 63.3
+    central_longitude: 15.0
+    standard_parallels:
+    - 63.3
+    - 63.3
 ```
 
+Which you can then use in a neural-lam configuration file like this:
+
 ```yaml
+# config.yaml
 datastore:
   kind: npyfilesmeps
   config_path: meps.datastore.yaml
@@ -286,43 +366,23 @@ training:
       v100m: 1.0
 ```
 
-## Pre-processing
-
-There are two main steps in the pre-processing pipeline: creating the graph and creating additional features/normalisation/boundary-masks.
-
-The amount of pre-processing required will depend on what kind of datastore you will be using for training.
-
-### Additional inputs
-
-#### MultiZarr Datastore
-
-* `python -m neural_lam.create_boundary_mask`
-* `python -m neural_lam.create_datetime_forcings`
-* `python -m neural_lam.create_norm`
+For npy-file based datastores you must separately run the command that creates the variables used for standardization:
 
-#### NpyFiles Datastore
-
-#### MDP (mllam-data-prep) Datastore
+```bash
+python -m neural_lam.datastore.npyfilesmeps.compute_standardization_stats <path-to-datastore-config>
+```
 
-An overview of how the different pre-processing steps, training and files depend on each other is given in this figure:
-<p align="middle">
-  <img src="figures/component_dependencies.png"/>
-</p>
-In order to start training models at least three pre-processing steps have to be run:
+### Graph creation
 
-### Create graph
 Run `python -m neural_lam.create_mesh` with suitable options to generate the graph you want to use (see `python neural_lam.create_mesh --help` for a list of options).
 The graphs used for the different models in the [paper](#graph-based-neural-weather-prediction-for-limited-area-modeling) can be created as:
 
-* **GC-LAM**: `python -m neural_lam.create_mesh --graph multiscale`
-* **Hi-LAM**: `python -m neural_lam.create_mesh --graph hierarchical --hierarchical` (also works for Hi-LAM-Parallel)
-* **L1-LAM**: `python -m neural_lam.create_mesh --graph 1level --levels 1`
+* **GC-LAM**: `python -m neural_lam.create_mesh <neural-lam-config-path> --graph multiscale`
+* **Hi-LAM**: `python -m neural_lam.create_mesh <neural-lam-config-path> --graph hierarchical --hierarchical` (also works for Hi-LAM-Parallel)
+* **L1-LAM**: `python -m neural_lam.create_mesh <neural-lam-config-path> --graph 1level --levels 1`
 
 The graph-related files are stored in a directory called `graphs`.
 
-### Create remaining static features
-To create the remaining static files run `python -m neural_lam.create_grid_features` and `python -m neural_lam.create_parameter_weights`.
-
 ## Weights & Biases Integration
 The project is fully integrated with [Weights & Biases](https://www.wandb.ai/) (W&B) for logging and visualization, but can just as easily be used without it.
 When W&B is used, training configuration, training/test statistics and plots are sent to the W&B servers and made available in an interactive web interface.
@@ -340,12 +400,11 @@ wandb off
 ```
 
 ## Train Models
-Models can be trained using `python -m neural_lam.train_model <datastore_type> <datastore_config_path>`.
+Models can be trained using `python -m neural_lam.train_model <config_path>`.
 Run `python neural_lam.train_model --help` for a full list of training options.
 A few of the key ones are outlined below:
 
-* `<datastore_type>`: The kind of datastore that you are using (should be one of `npyfiles`, `multizarr` or `mllam`)
-* `<datastore_config_path>`: Path to the data store configuration file
+* `<config_path>`: Path to the configuration for neural-lam (for example in `data/myexperiment/config.yaml`).
 * `--model`: Which model to train
 * `--graph`: Which graph to use with the model
 * `--epochs`: Number of epochs to train for

diff --git a/neural_lam/config.py b/neural_lam/config.py
@@ -51,11 +51,11 @@ class ManualStateFeatureWeighting:
 
     Attributes
     ----------
-    values : Dict[str, float]
+    weights : Dict[str, float]
         Manual weights for the state features.
     """
 
-    values: Dict[str, float]
+    weights: Dict[str, float]
 
 
 @dataclasses.dataclass
@@ -123,6 +123,17 @@ class _(dataclass_wizard.JSONWizard.Meta):
 
         tag_key = "__config_class__"
         auto_assign_tags = True
+        # ensure that all parts of the loaded configuration match the
+        # dataclasses used
+        # TODO: this should be enabled once
+        # https://github.com/rnag/dataclass-wizard/issues/137 is fixed, but
+        # currently cannot be used together with `auto_assign_tags` due to a
+        # bug it seems
+        # raise_on_unknown_json_key = True
+
+
+class InvalidConfigError(Exception):
+    pass
 
 
 def load_config_and_datastore(
@@ -142,7 +153,13 @@ def load_config_and_datastore(
     tuple[NeuralLAMConfig, Union[MDPDatastore, NpyFilesDatastoreMEPS]]
         The Neural-LAM configuration and the loaded datastore.
     """
-    config = NeuralLAMConfig.from_yaml_file(config_path)
+    try:
+        config = NeuralLAMConfig.from_yaml_file(config_path)
+    except dataclass_wizard.errors.UnknownJSONKey as ex:
+        raise InvalidConfigError(
+            "There was an error loading the configuration file at "
+            f"{config_path}. "
+        ) from ex
     # datastore config is assumed to be relative to the config file
     datastore_config_path = (
         Path(config_path).parent / config.datastore.config_path