Replies: 3 comments 2 replies
-
Following up on this to try to sum up discussion from a couple of different places including devteam meeting just now. High level points:
For point 3. ii. above (how to allow the specification of
This "order" entry would only be required if the hub had non-
We didn't make any decisions about which format we might use, but we thought the first might be easier to implement. |
Beta Was this translation helpful? Give feedback.
-
@zkamvar is going to investigate other hubs and whether anyone has optional output_type_ids specified. |
Beta Was this translation helpful? Give feedback.
-
General consensus at the 9/18 hubverse meeting was to drop "optional" values of output_type_id for all output_types.
|
Beta Was this translation helpful? Give feedback.
-
For the
cdf
output type as well as thepmf
output type for targets with atarget_type
of ordinal, the order of theoutput_type_id
s matters for validation, plotting and evaluation. This affects functionality in several of our packages:output_type_id
data type is character hubValidations#78.If the
output_type_id
values are numeric, the order is implied and so there is no problem. But for settings where theoutput_type_id
values are strings giving levels of a category, these functions will need a way to find the correct order. There are two issues here:1. how the hub can specify this order
Our idea is that the hub should list the
output_type_id
values in the correct order in theirtasks.json
file. Here are two examples (both drawn from the example-complex-forecast-hub).Example 1: an ordinal variable with levels
"low", "moderate", "high", "very high"
might have an entry like the following in their config file:Example 2: a hub collecting cdf values at numeric values of a target variable might have the following in their config file:
2. how the order can be accessed by hubverse functionality when needed
First, note that the ordering can be extracted from the tasks.json file if needed.
In R, the most natural representation of ordered categorical variables is with an ordered factor. However, because the factor values are stored in the
output_type_id
column, which may containoutput_type_id
values for other targets or output types, it is not feasible to convert the whole column to an ordered factor. Therefore, it seems necessary to store this information separately.The best option I see for this is to add arguments to any functions that need this information that specifies the order and/or allows the function to look up the right order. For example, this could take the form of: (1) a hub connection, which contains a reference to the tasks.json file, or (2) some kind of manual specification of the order, e.g. as a character vector or numeric vector as appropriate. It may be necessary to include enough structure here to allow for linking the ordering to the target (e.g. if a hub collected two different pmf or cdf targets, we need to know how to look up the right ordering to use).
issues filed
I've filed issues and/or PRs in the following places related to this:
output_type_id
levels must be listed in order hubDocs#141output_type_id
s for pmf or categorical output types hubUtils#153output_type_id
data type is character hubValidations#78Beta Was this translation helpful? Give feedback.
All reactions