Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance for how to define a dataset #4

Open
mortenwh opened this issue Mar 13, 2023 · 0 comments
Open

Guidance for how to define a dataset #4

mortenwh opened this issue Mar 13, 2023 · 0 comments

Comments

@mortenwh
Copy link
Contributor

mortenwh commented Mar 13, 2023

Several discussions now point at the need for defining what we mean by datasets, and how we should define them, e.g., in relation to granularity and time coverage.

We need to agree on a definition and provide it somewhere.

At MET Norway we use the following definition, based on the Unidata Common Data Model:

A ​dataset is a collection of data. In the context of the data management model, the storage mode of the dataset is irrelevant, since access mechanisms can be decoupled from the storage layer as experienced by a data consumer. Typically, a dataset represents a number of variables in time and space. It is a pre-defined grouping or collection of related data for an intended use, and may be categorised by:

  • source, ​ such as observations (​in situ, ​ remotely sensed) and numerical model projections and analyses;
  • processing level, ​ such as “raw data” (values measured by an instrument), calibrated data, quality-controlled data, derived parameters (preferably with error estimates), temporally and/or spatially aggregated variables;
  • data type, ​ including point data, sections and profiles, lines and polylines, polygons, gridded data, volume data, and time series (of points, grids, etc.).

Data having all of the same characteristics in each category, but different independent variable ranges and/or responding to a specific need, are normally considered part of a single dataset.

In the context of data preservation, a dataset consists of the data records and their associated knowledge (information, tools). In practice, our datasets should conform to the ​Unidata CDM dataset​ definition, as much as possible.

In order to best serve the data through the web services developed, the following guidance is given for defining datasets:

  • A dataset can be a collection of variables stored in, for example, a relational database or as flat files (e.g. ​NetCDF/CF​ or json)
  • A dataset is defined as a number of spatial and/or temporal variables
  • A dataset should be defined by the information content and not the production method. This implies that the output of, for example, a numerical model may be divided into several datasets that are related. This is also important in order to efficiently serve the data through ​web services​. For instance, model variables defined on different vertical coordinates should be separated as linked datasets​, since some ​OGC services (e.g. WMS) are unable to handle mixed coordinates in the same dataset
  • A good dataset does not mix ​feature types​, e.g. do not combine trajectories and gridded data in one dataset

Most importantly, a dataset should be defined to meet a consumer need. This means that the specification of a dataset should follow not only the content guidelines just listed, but also address the user needs for delivery, security and preservation.

Maybe this, or parts of it, can be adopted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant