Simple interfaces for describing weather machine learning datasets
The purpose of mllam_data_interface
is to provide a simple way to:
- describe a specification for what variables and attributes a specific
pytorch.Dataset
expects azarr
-based training dataset to contain - check a
zarr
training-dataset against a specification - provide storage for a collection of dataset specifications (these are in
specs/ with path format
specs/{spec_name}/{spec_version}.yaml
)
The intention with this package is that you can use it both when a) creating a dataset and b) loading a dataset to check that the dataset contains what it needs to. For convenience you can also run it from the command line.
E.g. when loading a dataset:
import mllam_data_interface as mdi
import xarray as xr
class MyWeatherDataset(pytorch.Dataset):
mllam_spec = "neural_lam:v0.1.0"
def __init__(self, dataset_path):
self.ds = xr.open_zarr(self.dataset_path)
mdi.check_dataset(ds=self.ds, spec_identifier=self.mllam_spec)
To check that a training dataset matches a spec you can use
mllam_data_interface
from the command line:
python -m mllam_data_interface.check my_dataset.zarr neural_lam:v0.1.0
Which will output something like (if the dataset matches the spec):
python -m mllam_data_interface.check ../mllam-data-prep/example.danra.zarr neural_lam:v0.1.0
2024-04-09 11:04:21.943 | INFO | __main__:<module>:72 - Opening Zarr file at example.danra.zarr
2024-04-09 11:04:22.023 | INFO | __main__:<module>:76 - Dataset matches the spec!
Or (if the dataset doesn't match the spec):
2024-04-09 11:06:11.439 | INFO | __main__:<module>:72 - Opening Zarr file at example.danra.incomplete.zarr
2024-04-09 11:06:11.518 | ERROR | __main__:<module>:79 - Variable static is missing from the dataset.
All specifications are given as yaml
-files. Currently, the specifications allow you to specify:
- Which variables the dataset must contain, and which dimensions each should have (the order of the dimensions is also checked)
- Which attributes the dataset must have
For example, the spec neural_lam:v0.1.0
specifies that the dataset must contain the variables static
, state
and forcing
variables could be written as:
variables:
static:
dims: [grid_index, feature]
state:
dims: [time, grid_index, state_feature]
forcing:
dims: [time, grid_index, forcing_feature]
attributes: [version]