From 72abd597ac032abed035ff1335a7d14fd4a46719 Mon Sep 17 00:00:00 2001 From: Anna Krystalli Date: Fri, 2 Aug 2024 18:28:13 +0300 Subject: [PATCH 1/5] Bump version to v3.0.1 --- docs/source/conf.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conf.py b/docs/source/conf.py index 14a8e97b..e2e75d4f 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -73,7 +73,7 @@ # -- Options for EPUB output epub_show_urls = 'footnote' -schema_version = "v3.0.0" +schema_version = "v3.0.1" # Use schema_branch variable to specify a branch in the schemas repository from which config schema will be source, especially for docson widgets. # Useful if the schema being documented hasn't been released to the `main` branch in the schemas repo yet. If version has been released already, set this to "main". schema_branch = "br-"+schema_version From c613e844ba5b3d98892f153611d3d62566065e59 Mon Sep 17 00:00:00 2001 From: Anna Krystalli Date: Fri, 2 Aug 2024 18:29:23 +0300 Subject: [PATCH 2/5] Add section on stability of model out schema. Resolves #143 --- docs/source/user-guide/model-output.md | 54 ++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/docs/source/user-guide/model-output.md b/docs/source/user-guide/model-output.md index 18b41424..6116cc7f 100644 --- a/docs/source/user-guide/model-output.md +++ b/docs/source/user-guide/model-output.md @@ -122,3 +122,57 @@ Validation of forecast values occurs in two steps: * Loads only data that are needed * Disadvantages: * Harder to work with; teams and people who want to work with files need to install additional libraries + +(model-output-schema)= +## The importance of a stable model output file schema + +> Note the difference in the following discussion between [hubverse schema](https://github.com/hubverse-org/schemas) - the schema which hub config files are validated against - and [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) - the mapping of model output columns to data types. + +Because we store model output data as separate files but open them as a single `arrow` dataset using the `hubData` package, for a hub to be successfully accessed as an `arrow dataset`, it's necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected). + +Many common task IDs are covered by the [hubverse schema](#model-tasks-tasks-json-interactive-schema), are validated during hub config validation and should therefore have consistent and stable data types. However, there are a number of situations where a single consistent data type cannot be guaranteed, e.g.: +- New rounds introducing changes in custom task ID value data types, which are not covered by the hubverse schema. +- New rounds introducing changes in task IDs covered by the schema but which accept multiple data types (e.g. `scenario_id` where both `integer` and `character` are accepted or `age_group` where no data type is specified in the hubverse schema). +- Adding new output types, which might introduce `output_type_id` values of a new data type. + +While validation of config files will alert hub administrations to discrepancies in task ID value data types across mideling tasks and rounds, any changes to a hub's config which has the potential to change the overall data type of model output columns after submissions have been collected could cause issues downstream and should be avoided. This is primarily a problem for parquet files, which encapsulate a schema within the file, but has a small chance to cause parsing errors in csvs too. + +(output-type-id-datatype)= +### The `output_type_id` column data type + +Output types are configured and handled differently than task IDs in the hubverse. + +On the one hand, **different output types can have output type ID values of varying data type** and adhering to these data types is imposed by downstream, output type specific hubverse functionality like ensembling or visualisation. +For example, hubs expect `double` output type ID values for `quantile` output types but `character` output type IDs for a `pmf` output type. + + On the other hand, the **use of a long format for hubverse model output files requires that these multiple data types are accomodated in a single `output_type_id` column.** +This makes the output type ID column unique within the model output file in terms of how it's data type is determined, configured and validated. + +During submission validation, two checks are performed on the `output_type_id` column: +1. **Subsets of `output_type_id` column values** associated with a given output type are **checked for being able to be coerced to the correct data type defined in the config** for that output type. This ensures correct output type specific downstream handling of the data is possible. +2. The **overall data type of the `output_type_id` column** matches the overall hub schema expectation. + +#### Determining the overall `output_type_id` column data type automatically + + To determine the overall `output_type_id` data type, the default behaviour is to automatically **detect the simplest data type that can encode all output type ID values across all rounds and output types** from the config. + + The benefit of this automatic detection is that it provides flexibility to the `output_type_id` column to adapt to the output types a hub is actually collecting. For example, a hub which only collects `mean` and `quantile` output types would, by default, have a `double` `output_type_id` column. + + The risk of this automatic detection however arises if, in subsequent rounds -after submissions have begun-, the hub decides to also start collecting a `pmf` output type. This would change the default `output_type_id` column data type from `double` to `character` and cause a conflict between the `output_type_id` column data type in older and newer files when trying to open the hub as an `arrow` dataset. + +### Fixing the `output_type_id` column data type with the `output_type_id_datatype` property + +To enable hub administrators to configure and communicate the data type of the `output_type_id` column at a hub level, the hubverse schema allows for the use of an optional `output_type_id_datatype` property. +This property should be provided at the top level of `tasks.json` (i.e. sibling to `rounds` and `schema_version`), can take any of the following values: `"auto"`, `"character"`, `"double"`, `"integer"`, `"logical"`, `"Date"` and can be used to fix the `output_type_id` column data type. + +```json +{ + "schema_version": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json", + "rounds": [...], + "output_type_id_datatype": "character" +} +``` +If not supplied or if `"auto"` is set, the default behaviour of automatically detecting the data type from `output_type_id` values is used. + +This gives hub administrators the ability to future proof the `output_type_id` column in their model output files if they are unsure whether they may start collecting an output type that could affect the schema, by setting the column to `"character"` (the safest data type that all other values can be encoded as) at the start of data collection. + From 10066b6c132e9ed4424ad206d14add30f33250b2 Mon Sep 17 00:00:00 2001 From: Anna Krystalli Date: Fri, 2 Aug 2024 18:29:58 +0300 Subject: [PATCH 3/5] Introduce output_type_id_datatype property. Resolves #143 --- docs/source/quickstart-hub-admin/tasks-config.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/source/quickstart-hub-admin/tasks-config.md b/docs/source/quickstart-hub-admin/tasks-config.md index 721f75d6..3c1ae1a3 100644 --- a/docs/source/quickstart-hub-admin/tasks-config.md +++ b/docs/source/quickstart-hub-admin/tasks-config.md @@ -247,4 +247,18 @@ There are [two ways](https://github.com/hubverse-org/schemas/blob/de580d56b8fc5c ```
- + ## Step 9: Optional - Set up `"output_type_id_datatype"`: + +Once all modeling tasks and rounds have been configured, you may also choose to fix the `output_type_id` column data type across all model output files of the hub using the optional `"output_type_id_datatype"` property. + +This property should be provided at the top level of `tasks.json` (i.e. sibling to `rounds` and `schema_version`) and can take any of the following values: `"auto"`, `"character"`, `"double"`, `"integer"`, `"logical"`, `"Date"`. + +```json +{ + "schema_version": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json", + "rounds": [...], + "output_type_id_datatype": "character" +} +``` + +For more context and details on when and how to use this setting, please see the [`output_type_id` column data type](output-type-id-datatype) section on the **model output** page. From 48d98c7fefe14fa4caa0cd859d2d764b685f08e9 Mon Sep 17 00:00:00 2001 From: Anna Krystalli Date: Mon, 5 Aug 2024 10:49:46 +0300 Subject: [PATCH 4/5] Clarify potential implications. Correct typos --- docs/source/user-guide/model-output.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/user-guide/model-output.md b/docs/source/user-guide/model-output.md index 6116cc7f..089ae77e 100644 --- a/docs/source/user-guide/model-output.md +++ b/docs/source/user-guide/model-output.md @@ -128,14 +128,14 @@ Validation of forecast values occurs in two steps: > Note the difference in the following discussion between [hubverse schema](https://github.com/hubverse-org/schemas) - the schema which hub config files are validated against - and [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) - the mapping of model output columns to data types. -Because we store model output data as separate files but open them as a single `arrow` dataset using the `hubData` package, for a hub to be successfully accessed as an `arrow dataset`, it's necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected). +Because we store model output data as separate files but open them as a single `arrow` dataset using the `hubData` package, for a hub to be successfully accessed as an `arrow dataset`, it is necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected). Many common task IDs are covered by the [hubverse schema](#model-tasks-tasks-json-interactive-schema), are validated during hub config validation and should therefore have consistent and stable data types. However, there are a number of situations where a single consistent data type cannot be guaranteed, e.g.: - New rounds introducing changes in custom task ID value data types, which are not covered by the hubverse schema. - New rounds introducing changes in task IDs covered by the schema but which accept multiple data types (e.g. `scenario_id` where both `integer` and `character` are accepted or `age_group` where no data type is specified in the hubverse schema). - Adding new output types, which might introduce `output_type_id` values of a new data type. -While validation of config files will alert hub administrations to discrepancies in task ID value data types across mideling tasks and rounds, any changes to a hub's config which has the potential to change the overall data type of model output columns after submissions have been collected could cause issues downstream and should be avoided. This is primarily a problem for parquet files, which encapsulate a schema within the file, but has a small chance to cause parsing errors in csvs too. +While validation of config files will alert hub administrations to discrepancies in task ID value data types across modeling tasks and rounds, any changes to a hub's config which has the potential to change the overall data type of model output columns after submissions have been collected could cause issues downstream and should be avoided. These issues can range from data type casting being required in downstream analysis code that used to work, not being able to filter on columns with data type discrepancies between files before collecting to an inability to open hub model output data as an `arrow` dataset. They are primarily a problem for parquet files, which encapsulate a schema within the file, but have a small chance to cause parsing errors in csvs too. (output-type-id-datatype)= ### The `output_type_id` column data type @@ -174,5 +174,5 @@ This property should be provided at the top level of `tasks.json` (i.e. sibling ``` If not supplied or if `"auto"` is set, the default behaviour of automatically detecting the data type from `output_type_id` values is used. -This gives hub administrators the ability to future proof the `output_type_id` column in their model output files if they are unsure whether they may start collecting an output type that could affect the schema, by setting the column to `"character"` (the safest data type that all other values can be encoded as) at the start of data collection. +This gives hub administrators the ability to future-proof the `output_type_id` column in their model output files if they are unsure whether they may start collecting an output type that could affect the schema, by setting the column to `"character"` (the safest data type that all other values can be encoded as) at the start of data collection. From f330a802da0bb5dfafeaec55ddf51a6387495f94 Mon Sep 17 00:00:00 2001 From: Anna Krystalli Date: Mon, 5 Aug 2024 17:38:02 +0300 Subject: [PATCH 5/5] Clarify what expectation of successful interaction with hub data is --- docs/source/user-guide/model-output.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/user-guide/model-output.md b/docs/source/user-guide/model-output.md index 089ae77e..da2bf963 100644 --- a/docs/source/user-guide/model-output.md +++ b/docs/source/user-guide/model-output.md @@ -128,7 +128,7 @@ Validation of forecast values occurs in two steps: > Note the difference in the following discussion between [hubverse schema](https://github.com/hubverse-org/schemas) - the schema which hub config files are validated against - and [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) - the mapping of model output columns to data types. -Because we store model output data as separate files but open them as a single `arrow` dataset using the `hubData` package, for a hub to be successfully accessed as an `arrow dataset`, it is necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected). +Because we store model output data as separate files but open them as a single [`arrow` dataset](https://arrow.apache.org/docs/r/reference/Dataset.html) using the `hubData` package, for a hub to be [successfully accessed and fully queryable across all columns as an `arrow dataset`](https://arrow.apache.org/docs/r/articles/dataset.html), it is necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected). Many common task IDs are covered by the [hubverse schema](#model-tasks-tasks-json-interactive-schema), are validated during hub config validation and should therefore have consistent and stable data types. However, there are a number of situations where a single consistent data type cannot be guaranteed, e.g.: - New rounds introducing changes in custom task ID value data types, which are not covered by the hubverse schema.