Replies: 18 comments 29 replies
-
Noting that we might want to consider standardizing on a version of the data format supported by epidatr. |
Beta Was this translation helpful? Give feedback.
-
There was further discussion about this on a hubverse development call. We seemed to coalesce around a proposal whereby we would set standards for "Raw Target Data" and "Processed Target Data". Some additional notes on this Raw target data format would be an Epidata API-like format (with possible adjustments for hub task-id variable flexibility in column names) that defines the raw target data in a simple way. E.g. for a timeseries project, this would essentially be a dataframe with a timeseries. Processed target data format would be a data-duplicative hubverse-like format that has one row for each target (and possibly each target-round) combination. So, e.g. this would be something that would have the "true" value for each target/forecast-date combination (or something like this). Which has the potential to be duplicative if you are storing horizon 1 target for time t and the horizon 2 target for time t-1, they would be the same thing. But the advantage of having the data in this way is that this is what would likely be needed to be passed to scoring functions or plotting functions or other secondary analysis functions. It was discussed that we would allow for either or both of these formats to be used to store data (we would need some naming conventions). Allowing for flexibility for the hubs to either write their own code to create the processed target data to have some pre-defined functions to operate on the raw target data. |
Beta Was this translation helpful? Give feedback.
-
Noting that somewhere we may want some documentation about how the Epidata API and hubverse concepts map to each other. Here is a start, but we should add to this and flesh it out.
|
Beta Was this translation helpful? Give feedback.
-
Documenting some conversations and things run into with @elray1 @bsweger and @nikosbosse this week about truth data as it relates to scoring via
|
Beta Was this translation helpful? Give feedback.
-
I have some questions about time indices. Above, we've said these would be called
If we do have to choose a name, I'd vote for choosing |
Beta Was this translation helpful? Give feedback.
-
Noting that I have implemented some of these proposals in the example complex forecast hub, here: https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub I don't consider this to be finalized, so we can change anything we don't like. |
Beta Was this translation helpful? Give feedback.
-
I dropped some more detailed discussion of how the proposed target data could be used here. It occurred to me that maybe we could put the target data for both formats 2 and 3 in that writeup in one file, with |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for the proposal, I have some questions/remarks if that's ok:
|
Beta Was this translation helpful? Give feedback.
-
OK, here is a revised/additional write up doing some investigations into the comments that @LucieContamin, @nickreich, and I had on the last iteration. It's quite long, so I will attempt a slightly briefer "summary" here.
It seems to me that we should go with option (b) or (d). Option (a) doesn't work for sample outputs, and option (c) is too computationally involved. The main tradeoff between (b) and (d) is that (b) specifies the observed values very clearly while (d) offers a potentially substantial savings in storage space. I was more hesitant about (d) before, but I think we could mitigate concerns about the usability of that format by:
With those kinds of functions in place, I think the format is clear enough that users could work with it without issue. Of course, we could also say we'll support both, but that requires a little more coding on our part... |
Beta Was this translation helpful? Give feedback.
-
Combining points from a couple of different threads above so that we have our leading candidate for a final answer in one place:
|
Beta Was this translation helpful? Give feedback.
-
For posterity and general communication purposes, noting that discussion in today's hubverse dev call led to a decision to explore formalizing functions, perhaps in I think the idea was that a hub would not necessarily store the
However, these functions might create as output a data frame with the format described just above. |
Beta Was this translation helpful? Give feedback.
-
I filed a related issue on hubData here: hubverse-org/hubData#14 |
Beta Was this translation helpful? Give feedback.
-
After a little more thought, I'm still feeling hesitant about going the route of having hubs store code for converting target time series data into target values rather than the target values data, for these reasons:
So, I am wondering if we can find some solutions to questions about that were raised in yesterday's discussions. Here are a few thoughts on two themes that I remember coming up:
However, if others are less daunted than I am by the challenges of defining standards for functions mapping time series to target values, or feel strongly that it's just the right way to go, I'm open to continued discussion about it. |
Beta Was this translation helpful? Give feedback.
-
Here’s a summary of where I think we are regarding target value data formats:
|
Beta Was this translation helpful? Give feedback.
-
It's on my list to try to condense all of this into a more unified/coherent description of where we landed. In the interim, I am noting that in discussion in threads over at example-complex-forecast-hub and hubExamples, we landed on |
Beta Was this translation helpful? Give feedback.
-
I've put together a draft write up on target data standards. |
Beta Was this translation helpful? Give feedback.
-
@elray1, I didn't read every line, but it looks good. I think it might help to make some comparable comments in the hubExamples docs.. |
Beta Was this translation helpful? Give feedback.
-
I’ve been noodling on our ongoing discussions around target data. I’d like to propose that we take an iterative approach to this problem:
This iterative approach to data format definitions has been successful with the model output standards, and I think it makes sense to do that for target data as well. To that end, I’d like to propose that in our “target data standard v 1.0” we use essentially the format that we’ve been discussing in the last few weeks, with some updates to the names of the objects based on our recent discussions. Specifically, I propose that we will have two formats representing target data directly in the first case, or “Oracle predictions” derived from the target data in the second case:
Additionally, for the oracle outputs format, I propose that the column that would have been named
|
Beta Was this translation helpful? Give feedback.
-
As relevant to discussions in #68, we are trying to zero in on a format for truth data.
After discussion today, @elray1 @annakrystalli and myself decided that the following structure for truth data is necessary for the proposed
plot_step_ahead_forecasts()
function:time_idx
column: typically would be a date that could be left_joined with thetarget_date
column inforecast_data
value
column: the value for the targettarget_date
,origin_date
orhorizon
]We also observed that for the purposes of scoring of targets that are not step-ahead (e.g. a target that is a fixed once-a-season value) this structure could possibly be simpler, like just the task_id columns that define unique target values.
Beta Was this translation helpful? Give feedback.
All reactions