Guidance for how to define a dataset #4

mortenwh · 2023-03-13T15:17:07Z

Several discussions now point at the need for defining what we mean by datasets, and how we should define them, e.g., in relation to granularity and time coverage.

We need to agree on a definition and provide it somewhere.

At MET Norway we use the following definition, based on the Unidata Common Data Model:

A dataset is a collection of data. In the context of the data management model, the storage mode of the dataset is irrelevant, since access mechanisms can be decoupled from the storage layer as experienced by a data consumer. Typically, a dataset represents a number of variables in time and space. It is a pre-defined grouping or collection of related data for an intended use, and may be categorised by:

source, such as observations (in situ, remotely sensed) and numerical model projections and analyses;

processing level, such as “raw data” (values measured by an instrument), calibrated data, quality-controlled data, derived parameters (preferably with error estimates), temporally and/or spatially aggregated variables;

data type, including point data, sections and profiles, lines and polylines, polygons, gridded data, volume data, and time series (of points, grids, etc.).

Data having all of the same characteristics in each category, but different independent variable ranges and/or responding to a specific need, are normally considered part of a single dataset.

In the context of data preservation, a dataset consists of the data records and their associated knowledge (information, tools). In practice, our datasets should conform to the Unidata CDM dataset definition, as much as possible.

In order to best serve the data through the web services developed, the following guidance is given for defining datasets:

A dataset can be a collection of variables stored in, for example, a relational database or as flat files (e.g. NetCDF/CF or json)

A dataset is defined as a number of spatial and/or temporal variables

A dataset should be defined by the information content and not the production method. This implies that the output of, for example, a numerical model may be divided into several datasets that are related. This is also important in order to efficiently serve the data through web services. For instance, model variables defined on different vertical coordinates should be separated as linked datasets, since some OGC services (e.g. WMS) are unable to handle mixed coordinates in the same dataset

A good dataset does not mix feature types, e.g. do not combine trajectories and gridded data in one dataset

Most importantly, a dataset should be defined to meet a consumer need. This means that the specification of a dataset should follow not only the content guidelines just listed, but also address the user needs for delivery, security and preservation.

Maybe this, or parts of it, can be adopted?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance for how to define a dataset #4

Guidance for how to define a dataset #4

mortenwh commented Mar 13, 2023 •

edited

Loading

Guidance for how to define a dataset #4

Guidance for how to define a dataset #4

Comments

mortenwh commented Mar 13, 2023 • edited Loading

mortenwh commented Mar 13, 2023 •

edited

Loading