Handling of orderings of variable levels for cdf outputs and pmf outputs for ordinal variables #24

elray1 · 2024-06-27T17:59:29Z

elray1
Jun 27, 2024
Maintainer

For the cdf output type as well as the pmf output type for targets with a target_type of ordinal, the order of the output_type_ids matters for validation, plotting and evaluation. This affects functionality in several of our packages:

For validation of cdf types, we need to ensure that the probabilities are non-decreasing across the category levels or numeric values of the response variable at which cdf values are elicited; this requires knowing the correct order of the category levels for categorical variables. See Check on ascending order of cdf values incorrect when output_type_id data type is character hubValidations#78.
For plotting, we will probably want to arrange the categories in the right order. I don't think we have relevant functionality in hubVis yet, but eventually we might add one or more functions to that package that could plot pmf and cdf predictions, and those functions would want the order.
For evaluation, some scores depend on getting the categories in the right order (e.g., this is required to calculate RPS for pmf output type)

If the output_type_id values are numeric, the order is implied and so there is no problem. But for settings where the output_type_id values are strings giving levels of a category, these functions will need a way to find the correct order. There are two issues here:

how the hub can specify this order
how the order can be accessed by hubverse functionality when needed.

1. how the hub can specify this order

Our idea is that the hub should list the output_type_id values in the correct order in their tasks.json file. Here are two examples (both drawn from the example-complex-forecast-hub).

Example 1: an ordinal variable with levels "low", "moderate", "high", "very high" might have an entry like the following in their config file:

...
                    "output_type": {
                        "pmf": {
                            "output_type_id": {
                                "required": [
                                    "low",
                                    "moderate",
                                    "high",
                                    "very high"
                                ],
                                "optional": null
                            },
                            "value": {
                                "type": "double",
                                "minimum": 0,
                                "maximum": 1
                            }
                        }
                    },
...

Example 2: a hub collecting cdf values at numeric values of a target variable might have the following in their config file:

                    "output_type": {
                        "cdf": {
                            "output_type_id": {
                                "required": [
                                    0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,
                                    2.25, 2.5, 2.75, 3, 3.25, 3.5, 3.75, 4,
                                    4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6,
                                    6.25, 6.5, 6.75, 7, 7.25, 7.5, 7.75, 8,
                                    8.25, 8.5, 8.75, 9, 9.25, 9.5, 9.75, 10,
                                    10.25, 10.5, 10.75, 11, 11.25, 11.5, 11.75,
                                    12, 12.25, 12.5, 12.75, 13, 13.25, 13.5,
                                    13.75, 14, 14.25, 14.5, 14.75, 15, 15.25,
                                    15.5, 15.75, 16, 16.25, 16.5, 16.75, 17,
                                    17.25, 17.5, 17.75, 18, 18.25, 18.5, 18.75,
                                    19, 19.25, 19.5, 19.75, 20, 20.25, 20.5,
                                    20.75, 21, 21.25, 21.5, 21.75, 22, 22.25,
                                    22.5, 22.75, 23, 23.25, 23.5, 23.75, 24,
                                    24.25, 24.5, 24.75, 25
                                ],
                                "optional": null
                            },
                            "value": {
                                "type": "double",
                                "minimum": 0,
                                "maximum": 1
                            }
                        }
                    },

2. how the order can be accessed by hubverse functionality when needed

First, note that the ordering can be extracted from the tasks.json file if needed.

In R, the most natural representation of ordered categorical variables is with an ordered factor. However, because the factor values are stored in the output_type_id column, which may contain output_type_id values for other targets or output types, it is not feasible to convert the whole column to an ordered factor. Therefore, it seems necessary to store this information separately.

The best option I see for this is to add arguments to any functions that need this information that specifies the order and/or allows the function to look up the right order. For example, this could take the form of: (1) a hub connection, which contains a reference to the tasks.json file, or (2) some kind of manual specification of the order, e.g. as a character vector or numeric vector as appropriate. It may be necessary to include enough structure here to allow for linking the ordering to the target (e.g. if a hub collected two different pmf or cdf targets, we need to know how to look up the right ordering to use).

issues filed

I've filed issues and/or PRs in the following places related to this:

elray1 · 2024-08-28T15:05:07Z

elray1
Aug 28, 2024
Maintainer Author

Following up on this to try to sum up discussion from a couple of different places including devteam meeting just now.

High level points:

My proposal above doesn't work because of the possibility of "interleaving" required and optional values for output_type_id.
For the pmf output_type, we think listing any values under "optional" does not make sense. For pmf, hubs should generally require that probabilities are submitted for all output_type_id levels. We should just get rid of the ability to specify "optional" values for pmf output_type_ids. Issues related to this should be filed on schemas and maybe hubValidations. (Exactly what we say in those issues might depend on our resolution to other points below, so I have held off on filing them for now.)
For the cdf output_type, there are two questions:
1. Do we want to (continue to) support a mix of optional and required values for this output type?
2. If so, how should we do it? See below for two specific proposals.
For the quantile output_type, we currently allow hubs to specify that some output_type_ids (i.e., some quantile levels) are required and others are optional.
1. Same as for cdf, do we want to (continue to) support a mix of optional and required values for this output type? One note here is that WIS computed based on different sets of quantile levels are not meaningfully comparable.
2. Note that we don't have the same ordering problem for quantile output_type because quantile levels are always numeric (or castable to numeric) and the correct ordering is the numeric ordering.
The handling of whether or not submission of a particular output_type is required is different for sample than it is for the other output types. For sample, we have introduced a boolean field, is_required, indicating whether this output type is required (subject to crossing with the required/optional status for other task ids in a particular task group). However, for all of the other output types, whether or not submission of that output type is required is specified implicitly through whether or not there are any required output_type_id levels. This is kindof a mixing of purposes. If a hub wants to say, "the pmf output type is optional, but if it's submitted then all output type id levels are required" the only way to do that is to split the pmf output type into its own task id block and set the values of some other task id variable in that block to optional. So this can be addressed within our current structure, but it's a little awkward. Maybe we should consider introducing this is_required field to those output types as well, especially if we do away with the optional/required fields for specifying ouput_type_ids for at least some of the other output_types.

For point 3. ii. above (how to allow the specification of output_type_id order, if we want to allow a mix of optional and required values for the cdf output type), two proposals were put forward:

Add a new "order" entry, with all values in the required and optional lists in the correct order. Here's an example of what this might look like:

                    "output_type": {
                        "cdf": {
                            "output_type_id": {
                                "required": [1, 3, 5, 7, 9],
                                "optional": [2, 4, 6, 8, 10],
                                "order": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
                            },
                            "value": {
                                "type": "double",
                                "minimum": 0,
                                "maximum": 1
                            }
                        }
                    },

This "order" entry would only be required if the hub had non-null entries for both "required" and "optional".

Replace the separate specification of optional and required values with a list of tuples, with the requirement status recorded alongside each entry. Here's a rough sketch of what this might look like for the above example (maybe we'd want objects with keys here or some other formatting change):

                    "output_type": {
                        "cdf": {
                            "output_type_id": [
                                [1, true], [2, false], [3, true], [4, false], [5, true],
                                [6, false], [7, true], [8, false], [9, true], [10, false]
                            ],
                            "value": {
                                "type": "double",
                                "minimum": 0,
                                "maximum": 1
                            }
                        }
                    },

We didn't make any decisions about which format we might use, but we thought the first might be easier to implement.

1 reply

nickreich Sep 18, 2024
Maintainer

Tagging @annakrystalli for input on this.

nickreich · 2024-09-18T13:48:08Z

nickreich
Sep 18, 2024
Maintainer

@zkamvar is going to investigate other hubs and whether anyone has optional output_type_ids specified.

1 reply

zkamvar Sep 18, 2024
Maintainer

I used this code to do the dirty work. It requires Github's CLI utility: https://gist.github.com/zkamvar/4ebfd30e8a758b4df6a1a9100853f1d2

For CDF and PMF outputs, there is only one repository: https://github.com/hubverse-org/flusight_hub_archive

For quantiles, it's a different story. For a lot of the Midas-network hubs, the optional quantiles are set to [0,1]

fields with optional quantile	repository
1	https://github.com/midas-network/covid19-scenario-modeling-hub
4	https://github.com/Testing-Forecast-Actions/ScenarioModellingHub
2	https://github.com/micokoch/simple-hub
13	https://github.com/midas-network/example_round-scenariohub
3	https://github.com/european-modelling-hubs/RespiCompass
2	https://github.com/midas-network/rsv-scenario-modeling-hub
6	https://github.com/midas-network/flu-scenario-modeling-hub
1	https://github.com/Testing-Forecast-Actions/TestingValidations
1	https://github.com/hubverse-org/example-complex-scenario-hub
4	https://github.com/midas-network/covid19-smh-research
2	https://github.com/LucieContamin/example_smh

nickreich · 2024-09-18T13:50:04Z

nickreich
Sep 18, 2024
Maintainer

General consensus at the 9/18 hubverse meeting was to drop "optional" values of output_type_id for all output_types.

it isn't a good idea to have them for pmf, as described above.
it isn't in general a good idea to have them for cdf either
having optional values for quantile output_types screws up ensemble and scoring computations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Hubverse

Handling of orderings of variable levels for cdf outputs and pmf outputs for ordinal variables #24

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The Hubverse

Handling of orderings of variable levels for cdf outputs and pmf outputs for ordinal variables #24

elray1 Jun 27, 2024 Maintainer

1. how the hub can specify this order

2. how the order can be accessed by hubverse functionality when needed

issues filed

Replies: 3 comments · 2 replies

elray1 Aug 28, 2024 Maintainer Author

nickreich Sep 18, 2024 Maintainer

nickreich Sep 18, 2024 Maintainer

zkamvar Sep 18, 2024 Maintainer

nickreich Sep 18, 2024 Maintainer

elray1
Jun 27, 2024
Maintainer

Replies: 3 comments 2 replies

elray1
Aug 28, 2024
Maintainer Author

nickreich Sep 18, 2024
Maintainer

nickreich
Sep 18, 2024
Maintainer

zkamvar Sep 18, 2024
Maintainer

nickreich
Sep 18, 2024
Maintainer