Proposal for target data (a.k.a. "truth data") formats #9

nickreich · 2023-06-06T20:43:15Z

nickreich
Jun 6, 2023
Maintainer

As relevant to discussions in #68, we are trying to zero in on a format for truth data.

After discussion today, @elray1 @annakrystalli and myself decided that the following structure for truth data is necessary for the proposed plot_step_ahead_forecasts() function:

time_idx column: typically would be a date that could be left_joined with the target_date column in forecast_data
value column: the value for the target
[a set of columns that match all task_id columns that are not target_date, origin_date or horizon]

We also observed that for the purposes of scoring of targets that are not step-ahead (e.g. a target that is a fixed once-a-season value) this structure could possibly be simpler, like just the task_id columns that define unique target values.

nickreich · 2023-10-13T17:29:44Z

nickreich
Oct 13, 2023
Maintainer Author

Noting that we might want to consider standardizing on a version of the data format supported by epidatr.

0 replies

nickreich · 2023-10-25T19:01:51Z

nickreich
Oct 25, 2023
Maintainer Author

There was further discussion about this on a hubverse development call. We seemed to coalesce around a proposal whereby we would set standards for "Raw Target Data" and "Processed Target Data". Some additional notes on this

Raw target data format would be an Epidata API-like format (with possible adjustments for hub task-id variable flexibility in column names) that defines the raw target data in a simple way. E.g. for a timeseries project, this would essentially be a dataframe with a timeseries.

Processed target data format would be a data-duplicative hubverse-like format that has one row for each target (and possibly each target-round) combination. So, e.g. this would be something that would have the "true" value for each target/forecast-date combination (or something like this). Which has the potential to be duplicative if you are storing horizon 1 target for time t and the horizon 2 target for time t-1, they would be the same thing. But the advantage of having the data in this way is that this is what would likely be needed to be passed to scoring functions or plotting functions or other secondary analysis functions.

It was discussed that we would allow for either or both of these formats to be used to store data (we would need some naming conventions). Allowing for flexibility for the hubs to either write their own code to create the processed target data to have some pre-defined functions to operate on the raw target data.

0 replies

nickreich · 2023-10-25T19:04:56Z

nickreich
Oct 25, 2023
Maintainer Author

Noting that somewhere we may want some documentation about how the Epidata API and hubverse concepts map to each other. Here is a start, but we should add to this and flesh it out.

Epidata API	hubverse	concept
time_value	time_idx	Time associated with the observation
issue		Time at which data were reported
geo_value	[task id cols]
value(?)	value	The observed value column

1 reply

elray1 Oct 25, 2023
Maintainer

Couple more quick note r.e. connections to tools developed by cmu/delphi:

The epiprocess package defines epi_df and epi_archive objects.
epi_dfs can have other_keys
epi_archives use a column named version to track versions of data releases.
More documentation in this vignette and this book

nickreich · 2024-01-18T17:36:17Z

nickreich
Jan 18, 2024
Maintainer Author

Documenting some conversations and things run into with @elray1 @bsweger and @nikosbosse this week about truth data as it relates to scoring via hubEvals and scoringutils.

For hubEvals, we opted to create a "processed target data" dataset that basically defines truth data for unique combinations of task-id variables. It's inefficient, but it works, and is kind of necessary for scoring. Noting that in this example, which has horizon and origin_date as task-id variables (but not target_date), the merging in of truth data required creating something like a target_date field from raw truth data that could then be joined on, for which the time_idx variable matches conceptually target_date.
For quantile, mean, median, and sample output-types, truth data can simply have columns for task-id variables and a value column.
For pmf output-types, an additional column that contains the level name is needed. @elray1 and I converged on a suggestion that would involve having the following columns in a processed target data file for pmf data
- [all task-id variables]
- output_type_id (which would house the observed category name)
- value (which would have a 1 in it for the observed data, and implied zeroes in other missing rows)

0 replies

elray1 · 2024-01-30T19:56:45Z

elray1
Jan 30, 2024
Maintainer

I have some questions about time indices. Above, we've said these would be called time_idx.

How critical is it that we choose a specific name for this? What are the pieces of functionality where we need this name to be standardized?
- note: I don't think it's necessary for hubEvals.
If we do want/need to choose a specific name, do we want it to be time_idx? Alternatives that are out there include:
- epidata API (CMU/Delphi): time_value
- tsibble: the name of the column(s) can be anything you like, but the function to get this info is called index.
- pandas: the datetime info is recorded as a data frame's index, where the index has classes like DatetimeIndex or PeriodIndex.

If we do have to choose a name, I'd vote for choosing time_value so that we are a little more consistent with the epidatr tooling. But could we get away with just not specifying this name?

1 reply

LucieContamin Jan 31, 2024
Maintainer

I don't know for all the functionality but for hubVis, the column name is a parameter of the function and can be easily change to any names.
I don't think we need to force a specific name for this but we might want to pick a default one and use it as default? I am okay withtime_value

elray1 · 2024-02-02T15:37:51Z

elray1
Feb 2, 2024
Maintainer

Noting that I have implemented some of these proposals in the example complex forecast hub, here: https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub

I don't consider this to be finalized, so we can change anything we don't like.

0 replies

elray1 · 2024-02-10T02:46:23Z

elray1
Feb 10, 2024
Maintainer

I dropped some more detailed discussion of how the proposed target data could be used here.

It occurred to me that maybe we could put the target data for both formats 2 and 3 in that writeup in one file, with output_type_id being NA for the format 2 rows, and just "know" that we need to join by the appropriate set of columns depending on the output type...?

1 reply

nickreich Feb 12, 2024
Maintainer Author

Yeah, I had the same thought about formats 2 & 3 - they are basically the same format. And that would align (+ some more important details) with where we were with thinking about "raw" (format 1) and "processed" (formats 2&3) target data from a few posts above.

LucieContamin · 2024-02-20T19:44:45Z

LucieContamin
Feb 20, 2024
Maintainer

Thank you very much for the proposal, I have some questions/remarks if that's ok:

For the format 1 (raw): I am wondering about the "target" information. If we have one target, I am ok with having the target (for example "hospitalization") information in the name of the file, but it might becomes complex if we have multiple targets. Would it be accepted to add a target column (optional)?
For the format 2 & 3 (processed):
- As I understand, we recommend to use enough task id columns to uniquely identify an observed target value for each row of model output with the output_type_id being in the join or not depending on the need.
- Is it the same logic for the output_type column? Meaning that we include one if necessary in the target data and use it on join or not depending on the need of the Hub.
- For the "drop 0 rows" convention is a bit dangerous, I agree. I will prefer to keep the 0 in the file.

2 replies

elray1 Mar 12, 2024
Maintainer

Realizing that I didn't directly respond to these points in my comment below. here are a few thoughts:

I wasn't really thinking about file names carefully in my earlier take. My updated proposal is to just call the file target-values.csv (or .parquet) and yes, to include the target as a column in the file contents if relevant to the hub.
2 & 3
- what you describe is where i started, but this is updated in my comment below to just include all the task ids. I looked at ideas for effectively including fewer task id variables as necessary in different rows within a single target-values file, and the resulting computational burden was large... and I like the idea of having a single unified file for the target values. However, we could reconsider:
  - going back to multiple files with different sets of task id variables to see if the computational burden is any better than what i had tried (my guess is that it would not be though...). I'm somewhat unenthusiastic about this idea.
  - allowing a hub to provide one target values data set with a reasonable subset of the task id and output metadata columns. e.g., if a hub collects reference_date and horizon but those are never relevant to a join statement, they could be omitted. And similarly, if a hub only uses mean and sample output types, for example, then join statements would not need any output metadata from the output_type or output_type_id columns. So those could be eliminated. I'd be on board with this idea.

LucieContamin Mar 12, 2024
Maintainer

Thanks for the detailed answers and for the update. I will reply to the other response, but following up on the idea of having fewer task id variable as necessary.
I like your "allowing a hub to provide one target values data set with a reasonable subset of the task id and output metadata columns" idea, for some cases, like sample output types. I will also like to add that some hubs might have a scenario_id column or other column that might not be necessary, on top of the reference_date, horizon and/or output_type and output_type_id columns, so they might like the idea to be able to "eliminate" these variables from the target data

elray1 · 2024-03-11T23:23:18Z

elray1
Mar 11, 2024
Maintainer

OK, here is a revised/additional write up doing some investigations into the comments that @LucieContamin, @nickreich, and I had on the last iteration. It's quite long, so I will attempt a slightly briefer "summary" here.

Overall, we think it makes sense for hubs to provide two versions of target data: (1) time series format; and (2) observed target values, sufficient to join into a model_out_tbl to end up with an observation for each row of model output. Format (1) is useful for handing off to modelers and for some plots, while format (2) is useful for scoring and for some plots.
I considered four possible formats for the second of these, with target values:
- (a) "Complete": Basically, a submission file that would have been generated by an "Oracle" that knew the eventually-observed outcomes ahead of time.
  - This file has every combination of task id variables, output types, and output type ids. This means that many observed values are repeated, e.g. for different combinations of reference_date and horizon that yield the same target_end_date, and for different quantile levels in quantile forecasts.
  - When joining/merging these target values into the model outputs, we can therefore merge on all task id variables as well as "output_type" and "output_type_id"
  - This data format is verbose, but the join operation is fast: about 0.2 to 0.3 seconds in my example.
  - However, note that this format and join strategy does not work for output type "sample"! For that output type, we have said that we will (eventually) allow modelers to use any string that they want for the sample index recorded in output_type_id. That means that a data file with target values can't know in advance what it should list in the output_type_id column in order for a join on output_type_id to be successful.
- (b) Reduces duplication where the same value is repeated for all quantile levels by introducing a convention where "*" is used in the output_type_id column of the target values data set, indicating that it should match all quantile levels.
  - Now, we do joins separately in two groups of output_types and then combine the results. The first group has output_types including mean, median, quantile, and eventually sample. For this group, we join on the task id variables and "output_type", but not "output_type_id". The second group has the pmf and cdf output types, where the value depends on the output_type_id. For this group, we join on the task id variables, "output_type", and "output_type_id".
  - This join operation is still fairly fast: i noticed run times of about 0.4 to 0.6 sec across different runs of my example
  - It also works for all output types, hurray!
- (c) Reduces duplication where the same value is repeated across different combinations of values for the task id variables, like different reference_dates and horizons with the same target_end_date. I introduced a data format here and tried a couple of different strategies for implementing the required join operation. But the join operation is complex, the fastest option I tried took ~3 seconds, an order of magnitude longer than the joins with other data formats I tried. Although this seemed pleasing in terms of having a parsimonious data representation, the computational burden is too high, I reject this option.
- (d) Going back to (b) as a starting point, additionally reduces duplication for pmf and cdf output types by keeping track of target values only for the observed category (for pmf) or the smallest cdf evaluation point where the cdf was equal to 1 (for cdf).
  - As with option (b), we have to do joins separately for different groups of output types. But now we handle pmf and cdf types separately, as there is different postprocessing required after the join. An initial join results in a value of 1 for the observed category or next cdf evaluation point above the observation, and NA for all other categories or cdf evaluation points. For the pmf output type, we then check within each group defined by values of task id variables to see if any values are non-missing within that group. If so, we set the missing values to 0. For the cdf output type, we do those operations and then sort by the output_type_id and take a cumulative sum to get to cdf values.
  - Even with those extra steps, this operation is still pretty quick, comparable in run time to the merge operation with format (b).

It seems to me that we should go with option (b) or (d). Option (a) doesn't work for sample outputs, and option (c) is too computationally involved. The main tradeoff between (b) and (d) is that (b) specifies the observed values very clearly while (d) offers a potentially substantial savings in storage space.

I was more hesitant about (d) before, but I think we could mitigate concerns about the usability of that format by:

introducing something like a merge_outputs_and_targets (or something like that -- maybe merge_mdl_outputs_target_values?) function in hubData so that users of hub data don't have to worry about whether they've done this right. We should write this function no matter which format we decide on.
introducing something like a validate_target_data function in hubAdmin that checks that the target data are set up correctly (for pmf/cdf types, only one row per group defined by task id variables, entries in the value column for those rows are all 1)

With those kinds of functions in place, I think the format is clear enough that users could work with it without issue.

Of course, we could also say we'll support both, but that requires a little more coding on our part...

3 replies

nikosbosse Mar 11, 2024

Have you ever looked into using e.g. parquet for storing? It might be the case that file sizes for b) and d) don't differ much when stored as parquet.

elray1 Mar 12, 2024
Maintainer

Good thought. I haven't looked into that specifically for the pmf/cdf output types here, but I was looking into parquet file sizes earlier today in the context of samples. My general take away in those investigations was that even in parquet format, cutting down the number of rows stored could have a very meaningful impact on storage size.

That said, the place where this will matter in terms of "can a hub go forward" will probably be in model output files, not target value files: there are more model output files than target files, and their contents basically match.

But I guess the same considerations r.e. representations of outputs also apply to model outputs, so we could try do some investigations into how large model outputs files would get using (b) and (d) kind of ideas for their bin probability representations and use that as a basis for decision making about what formats we support for those output types, rather than sizes of target value data.

elray1 Mar 12, 2024
Maintainer

As a quick exploration of this, I looked at one forecast/model output file using the (fairly coarse) set of 100 cdf evaluation points, filtering out the cdf rows where the value in that row minus the value in the previous row was less than 1e-10. For the particular example I looked at, this dropped the cdf forecasts from 21200 rows (for different combinations of location and horizon) to 13122. The resulting parquet files are 145kb and 134kb. this seems like an indication that all of this thinking about reduced representations may not be worth the effort?

elray1 · 2024-03-13T02:13:36Z

elray1
Mar 13, 2024
Maintainer

Combining points from a couple of different threads above so that we have our leading candidate for a final answer in one place:

hubs will provide target data in time series format, stored in target-data/time-series.csv or target-data/time-series.parquet.
- so far, we are not being very prescriptive about the format of this, but perhaps something similar to the layout of a tsibble. "fairly long", but if multiple indicators are observed then they might be stored in columns.
hubs will either provide target values stored in target-data/target-values.csv or target-data/target-values.parquet, or functions to compute target values from the time series data. (We have not fully settled on whether a standard is needed here, but maybe not.) In either case, the format of a target values data frame (as stored in the hub or as returned by a function) will be rectangular and "long".
- This should include enough of the task id variables and columns with metadata about the outputs (output_type, output_type_id) to uniquely identify observations; if all prediction targets are observed, a call like left_join(model_outputs, target_values) should place an observed value in each row of the model_outputs and it should not be the case that multiple rows of target_values match a single row of model_outputs.
- Any unnecessary task id variables can be removed, e.g. if target_date is included then reference_date and horizon can be omitted, and scenario_id can be omitted.
- The output_type and output_type_id columns only need to be included if the hub collects pmf or cdf outputs. If the hub collects quantile or sample outputs alongside pmf or cdf outputs, the target values data set will include an output_type_id column, but the value of that column will be ignored when merging target values with sample or quantile model outputs. Therefore, the target values should only include one row with the observed value for the quantile or sample forecasts in each task id group, rather than one row for each quantile level or for each sample index. The output_type_id column in those rows may be set to NA as a representation of the fact that that value does not contain information about the quantile level or sample index specified as the output_type_id in the model outputs.
- The observed value of the target will be stored in a column called "observation"

10 replies

LucieContamin Mar 27, 2024
Maintainer

NA is complex, I like * because it's easier to implement. I will need to think more on how to use NA, for Python here more documentation.

elray1 Mar 28, 2024
Maintainer

Noting that I have made 2 edits to the comment at top of this thread:

for now, sidestepping the question of whether hubs will save these in the hub and/or provide functions to compute these values; we should be able to agree on the data format without saying how a hub will make the data available.
decided to call the column "observation" rather than value, to avoid collision with the "value" column in model outputs.

elray1 Mar 29, 2024
Maintainer

I'm now leaning more toward using NA rather than "*" as the convention for what to put in output_type_id for sample and quantile target values. Three points:

The advantage to using NA is that it is more flexible in terms of data type, i.e., it can be used regardless of whether output_type_id should be treated as character or numeric. In contrast, "*" would force the output_type_id column to be treated as character. This could be inconvenient. For example, if a hub collects quantile and cdf forecasts and specifies numeric evaluation points for the cdf, then in the model outputs the output_type_id column would be numeric and it would be better if the data type was the same for that column in the target values data set.
I agree with Lucie's comments that NA is more complex -- however, I think this complexity would not have any impact on what we do with these target values. Currently, the main/only plan for what we would do with target values is to join them into model outputs. However, for that join operation the proposal is to split by output_type and join on the output_type_id only when the output_type is pmf or cdf. In other words, when the output type is quantile or sample, we will ignore the contents of the output_type_id in the target values data set (e.g., see the draft implementation here, where "output_type_id" doesn't show up in the join by specification when the output_type is "quantile"). That means that the complexities of NA will not have a practical impact on our use of the target values data set.
I thought Lucie's examples about weird behavior with NULL in R were pretty compelling.

LucieContamin Mar 29, 2024
Maintainer

It makes sense, I am now learning more toward NA too

annakrystalli Apr 2, 2024
Maintainer

Quick and belated two cents from me here.

i think there are two elements to this question, how these effectively missing values are represented within R and how they are represented within hub target data files:

In R I think it should definitely be NA.
In files is the more interesting question.
- In general when we teach data management to first year PhDs we always discourage them from using custom missing value codes like "*", so I would be against that.
- Also NULL is different to NA and does not really work in tabular data.
- In parquet, missing values are tracked through the files metadata and will always end up as NAs when read into R.
- Having said all that NA is R centric and @LucieContamin has raised some really interesting questions about using it for encoding missing values in an ecosystem which we want to eventually work with python too.
- I believe this is only an issue in csv files (or json files) as parquet will also provide a sensible value when read into python.
- As such, I think the default values for encoding NAs when reading csv files in readr::read_csv ad arrow::read_arrow_csv() and very instructive. They both will by default convert "NA" or "" to NAs BEFORE doing any column data type determination. Hence, these string encodings of missing values do not affect actual column data types. As the default behaviour of our csv reading functions will convert any of "NA" or "" as missing values, I suggest we also accept both in target data csvs. When read in, any such values will automatically be parsed to NAs in R but also accepting "" provides some leeway for folks that might want to make also working in python a bit easier.

elray1 · 2024-03-13T19:25:30Z

elray1
Mar 13, 2024
Maintainer

For posterity and general communication purposes, noting that discussion in today's hubverse dev call led to a decision to explore formalizing functions, perhaps in hubData, for standardizing transformations of time series data into common target values.

I think the idea was that a hub would not necessarily store the target_values data set, but would rather provide functionality for computing it from the time series data. Hubverse tools might be able to provide functions for those calculations in many common use cases, including:

discretized/binned values, on either an absolute scale or relative to a recent observation
things about season peak timing and intensity
handling alignment between dates like reference_date/origin_date, horizon, and target_date

However, these functions might create as output a data frame with the format described just above.

0 replies

elray1 · 2024-03-13T20:22:46Z

elray1
Mar 13, 2024
Maintainer

I filed a related issue on hubData here: hubverse-org/hubData#14

0 replies

elray1 · 2024-03-14T13:33:42Z

elray1
Mar 14, 2024
Maintainer

After a little more thought, I'm still feeling hesitant about going the route of having hubs store code for converting target time series data into target values rather than the target values data, for these reasons:

Defining target value data standards is a subset of the problem of defining standards for conversion functions. To define standards for conversion functions, we have to settle on the format of the target data they will output, but additionally what inputs they will accept and in what context they will run. For example, conversion functions might want access to hub config files (tasks.json in particular), auxiliary data (e.g. to access binning thresholds per location), and time series data. Do we pass those as arguments, or do we pass something like a hub connection object with functionality to access those things in a standardized way across locally-cloned and cloud-based hubs? It seems like we would also want some arguments like which subsets of target variable values and/or output types we want the function to produce target values for. Defining and implementing this feels possible, but like a decent amount of work that lengthens the path to usable functionality. If we put the responsibility for this setup on hub administrators, it lets them determine what data they need access to and how they will set that up.
Managing code feels like a harder problem than managing data. What are the code's dependencies? Where do unit tests live? Where does the code itself live -- in the main hub repository? In a separate package specified by the hub administrators? Again, these feel like questions that could be answered, but getting to good answers feels like a process. Imagining that someday we might want to have functionality like a template dashboard for forecast hubs that includes some out-of-the-box evaluations, we would need to have a generalized way of specifying the answers to questions like code dependencies in order to be able to run those evaluations -- but if downstream tools can rely on the ability to pull data from files, generating evaluations automatically becomes an easier task.
Providing data rather than code to create the data is more language-agnostic. For example, if a hub provides an R script to map time series into target values, users need to be able to run that R script as a part of their workflow. A hub that wants to better support people working in other languages will need to implement their conversion functions in multiple languages.

So, I am wondering if we can find some solutions to questions about that were raised in yesterday's discussions. Here are a few thoughts on two themes that I remember coming up:

The variables that need to be used may differ by target or output type, or by modeling round.
- Example 1: A hub starts off with task id variables like (location, reference_date, and target_date), but in later rounds they add in age_group.
- Example 2: A hub collects some predictions in sample format and some in cdf format.
- My new thoughts since yesterday: Maybe we could say that the target values data live under the target-data directory, either as a single csv/parquet file or using a hive partitioning strategy where target values are potentially split up by any combination of task id variables or output_type as needed/desired by hub administrators. This could include creating separate target value data sets per modeling round or per output type if necessary.
Storing both time series data and target values data may be duplicative and bulky.
- Example 1: A simple hub collects quantile forecasts by location, reference date, and target date. In this case, their time series data and their target values data would be duplicates, perhaps up to column names (e.g., the time series file might have a column named date while the target values file might have a column named target_date, but all values would be the same).
- Example 2: A hub collects sample forecasts by location, reference date, and horizon. In this case, their target values data set would have the same value repeated for different combinations of reference date and horizon that correspond to the same target date. This could get bulky.
- My new thoughts since yesterday:
  - One note is that the size of these target value files is capped at the size of the collected model output files from one contributing model. In the example where a hub collects samples, if the hub collects 100 samples from each model, the target values data set will have 1% of the number of rows in the model output files, because we do not need to record the observed values for each sample index. Because model outputs are split up by round, individual model output files may be smaller than collected target values -- but, e.g., if using a target value partitioning strategy based on round id, the target value files for each round would be no larger than a single model's submissions.
  - It may be worth paying a cost in duplicative data storage in exchange for the relative simplicity of managing data rather than code. I acknowledge that there's certainly a trade off here.
  - Maybe we should be encouraging hubs to include target date as a task id variable, since this facilitates both plotting step-ahead forecasts and aligning with target values for scoring.

However, if others are less daunted than I am by the challenges of defining standards for functions mapping time series to target values, or feel strongly that it's just the right way to go, I'm open to continued discussion about it.

4 replies

nickreich Mar 14, 2024
Maintainer Author

I think your concerns are valid, and important. I agree that maintaining code feels very daunting.

None of the target data are "required" for a hub, correct? As in, we will lay out specifications for how target data would be created (and perhaps validated) if it were present, but a hub will not cease to work if the data are not there. I guess I'm getting at a situtation where maintaining both files was cumbersome (or not needed) and then a hub could just choose not to do it, right?

Maybe a related point is that if/when we implement these kinds of systems and validations, we might want some switch that a hub admin could flip to turn validations for target data on or off. E.g. if a hub already had a system for target data that they didn't want to update, or if they had a reason to do it a different way.

elray1 Mar 14, 2024
Maintainer

this is a great point. Maybe a good option is to say, "here is the format of target values data that is required by hubData::merge_outputs_and_target_values (shorter name needed). People who want to use that functionality as a first step toward scoring will need to get data in that format one way or another, good options include storing as part of the hub and/or creating open functions to create these data." And leave it at that.

...with maybe a later addendum saying "if you want to have an out-of-the-box eval dashboard you'll have to tell it how to get the target values data [through one or two standard ways we might settle on later, specific to that tool]"

...or we could just skip all of this and say, 'for scoring you'll need to add a column called "observation" to you model outputs'.

nickreich Mar 14, 2024
Maintainer Author

Another thought is that as a kind of compromise here, we could choose to create and maintain a function (as part of hubAdmin? or hubData?) that could translate data from the timeseries format to the target-values format for basic, common, straight-forward cases?

nickreich Mar 14, 2024
Maintainer Author

Noting that this is suggested already by an issue here: hubverse-org/hubData#14

elray1 · 2024-03-28T19:08:29Z

elray1
Mar 28, 2024
Maintainer

Here’s a summary of where I think we are regarding target value data formats:

I believe that we have essentially agreed on an expected format for target values, described in this comment. Last call for additional thoughts on this format! Or requests to review it. Or requests for things that would help in reviewing it.
We haven’t settled on whether we expect a hub to drop these target values into a hub (e.g. in csv or parquet format), or provide functions to compute the target values -- and maybe at least for now, we don’t need to be too prescriptive about this as long as a hub provides some way to get target values in this format.
The example-complex-forecast-hub has an example of the target values that is aaaalmost in the specified format here (the only thing that needs updating is naming the column of observed values "observation" instead of "value"; see this issue).
If a data frame of target values has been provided in that format, it can be merged into model output data, adding a column to the model outputs called "observation" that can be used for purposes of computing scores, etc. I have filed an issue here proposing to add a function that does this merge/join operation to hubData.
For purposes of functions in hubEvals, we could assume/require that the user has already merged their target values into their model outputs. So functions in that package would expect as input something like a model_output_tbl that has been augmented with the one additional "observation" column. Development for that package could proceed on that assumption.

2 replies

nickreich Mar 28, 2024
Maintainer Author

The last bullet point is related to hubverse-org/hubEvals#13

annakrystalli Apr 3, 2024
Maintainer

Just a quick note that I belatedly added a minor comment for your consideration: https://github.com/orgs/Infectious-Disease-Modeling-Hubs/discussions/9#discussioncomment-8982822

elray1 · 2024-08-16T18:34:16Z

elray1
Aug 16, 2024
Maintainer

It's on my list to try to condense all of this into a more unified/coherent description of where we landed. In the interim, I am noting that in discussion in threads over at example-complex-forecast-hub and hubExamples, we landed on target_observations as the name of the data set with observed values for each prediction target, rather than target_values. The main impetus in using observations in data sets of observed values is to avoid collision with the word value in model output files. This avoids problems with two columns named value when merging the data frames.

0 replies

elray1 · 2024-08-29T22:26:07Z

elray1
Aug 29, 2024
Maintainer

I've put together a draft write up on target data standards.

the document is on github here
- I wasn't sure where this should actually live.
you can see an html preview here (though the table renderings are garbled)

2 replies

elray1 Oct 7, 2024
Maintainer

Here are some comments from @nikosbosse on the above document:

I also made some notes when I read https://htmlpreview.github.io/?https://github.com/reichlab/hub-infrastructure-experiments/blob/main/eval-examples/target_data_formats.html. These are mostly about "how easy is it to understand what's happening from this doc". I started making these notes without having looked at the discussion on output formats for 6 months. So, in some sense, they represent the initial reactions of a mostly-uninformed outsider reading the documentation for the first time - not sure it is actually meant as an introduction to an outside reader.

Notes:

it's not immediately clear to me what a "target" data format is. My understanding is that "target" data means data with observed values. Maybe it makes sense to say that explicitly, e.g. "we call the data with observations 'target data', as the observations represent the prediction target'.
The terms "Time series data in “long” format:" and "Target observations" don't seem very descriptive to me without further explanation (especially since the corresponding files are called forecast_target_ts and forecast_target_observations). Also, aren't both formats in a long format?
From the vignette, it's quite clear to me what exactly the requirements for the two file formats are (i.e. what is required, what is just a column present in the examples).
- If I understand correctly, the only requirement for the "time series data" is a column called "observation". Maybe it makes more sense to call this format "raw" or "simple"?
- For the "target observations" format, you explain the requirements later on in the vignette. Maybe it would be helpful to have a one-sentence summary at further at the beginning explaining the relationship between the two target formats. Something like "the target observations format also requires a column called 'observation' and requires different columns according to the type of the forecast target (see below)")
Is there anything I can do with "Time series" format data that I can't do with "Target observations" format data? From the vignette alone my understanding was that the "Target observations" data is essentially a version of it with more/stricter requirements. The context in the discussion on why the two formats exist ("you mostly use the target one for evaluations, but then again it's helpful to have a simpler version for reason xyz") would be helpful to me as a reader
The term "model output file" is mentioned - maybe a link would be helpful?
Under "Target data uses" in the table - what exactly is "model estimation"?

elray1 Oct 15, 2024
Maintainer

I've just pushed some updates to the above-linked documents attempting to address some of Nikos's questions as well as questions that @harryhoch raised in the hubverse devteam meeting about the use of <NA> for the output_type_id for quantile predictions. In case it's helpful, the diff is available here.

harryhoch · 2024-10-16T01:41:56Z

harryhoch
Oct 16, 2024
Collaborator

@elray1, I didn't read every line, but it looks good. I think it might help to make some comparable comments in the hubExamples docs..

0 replies

elray1 · 2024-10-25T15:49:50Z

elray1
Oct 25, 2024
Maintainer

I’ve been noodling on our ongoing discussions around target data. I’d like to propose that we take an iterative approach to this problem:

Let’s make some kind of formal decision about format(s) for target data that we will support and build some functionality around it. (In fact, we’ve already been building functionality around it in both hubVis and hubEvals without having ratified the data standards.)
After that, we can continue to iterate on the documentation for it to try to get to a place where the format is clear/intelligible to our users.
If we realize or get feedback that the formats we select are not working or we need to make refinements, we can make changes to the format definitions.

This iterative approach to data format definitions has been successful with the model output standards, and I think it makes sense to do that for target data as well.

To that end, I’d like to propose that in our “target data standard v 1.0” we use essentially the format that we’ve been discussing in the last few weeks, with some updates to the names of the objects based on our recent discussions. Specifically, I propose that we will have two formats representing target data directly in the first case, or “Oracle predictions” derived from the target data in the second case:

I propose that we use the name “time series” for format 1 with observed target data in time series form. (This is not a change from what we've been discussing.)
- function arguments could be named target_time_series. or for simplicity, we could just stick with target_data, which is what’s used in hubVis now.
I propose that we use the name “oracle output” for format 2, which is a processed/reduced version of model output that would have been generated if you knew the outcome in advance. (The reductions I refer to here involve keeping only the row/column combinations that are needed to enable a join of oracle output with model output.)
- function arguments would be named oracle_output

Additionally, for the oracle outputs format, I propose that the column that would have been named "value" in an ordinary model output submission file would be called "oracle_value" instead.

Our previous proposal, and existing work in example-complex-forecast-hub, hubExamples, hubEvals, and the hubEnsembles documentation, used the name "observation" for this column, but discussion in recent weeks made me realize that this is confusing because the column does not contain observed data; it contains predicted values from a model that knows the observed data. I think it's probably worth the effort to update those packages in order to have the clearer name.

3 replies

zkamvar Oct 25, 2024
Maintainer

I agree with this proposal! Calling it "oracle outputs" really hits the point home for me.

So, as I understand it, the modelers can use the time series target data from the previous round to produce forecasts and nowcasts. During evaluations of the models, the oracle data is generated from updated time series target data. The oracle data is used as a benchmark for comparing the model outputs and providing evaluations. Is that correct?

elray1 Oct 25, 2024
Maintainer

that's exactly right

elray1 Oct 30, 2024
Maintainer

I edited the above to the singular oracle_output (rather than oracle_outputs), matching model-output.

The Hubverse

Proposal for target data (a.k.a. "truth data") formats #9

nickreich Jun 6, 2023 Maintainer

Replies: 18 comments · 29 replies

nickreich Oct 13, 2023 Maintainer Author

nickreich Oct 25, 2023 Maintainer Author

nickreich Oct 25, 2023 Maintainer Author

elray1 Oct 25, 2023 Maintainer

nickreich Jan 18, 2024 Maintainer Author

elray1 Jan 30, 2024 Maintainer

LucieContamin Jan 31, 2024 Maintainer

elray1 Feb 2, 2024 Maintainer

elray1 Feb 10, 2024 Maintainer

nickreich Feb 12, 2024 Maintainer Author

LucieContamin Feb 20, 2024 Maintainer

elray1 Mar 12, 2024 Maintainer

LucieContamin Mar 12, 2024 Maintainer

elray1 Mar 11, 2024 Maintainer

nikosbosse Mar 11, 2024

elray1 Mar 12, 2024 Maintainer

elray1 Mar 12, 2024 Maintainer

elray1 Mar 13, 2024 Maintainer

LucieContamin Mar 27, 2024 Maintainer

elray1 Mar 28, 2024 Maintainer

elray1 Mar 29, 2024 Maintainer

LucieContamin Mar 29, 2024 Maintainer

annakrystalli Apr 2, 2024 Maintainer

elray1 Mar 13, 2024 Maintainer

elray1 Mar 13, 2024 Maintainer

elray1 Mar 14, 2024 Maintainer

nickreich Mar 14, 2024 Maintainer Author

elray1 Mar 14, 2024 Maintainer

nickreich Mar 14, 2024 Maintainer Author

nickreich Mar 14, 2024 Maintainer Author

elray1 Mar 28, 2024 Maintainer

nickreich Mar 28, 2024 Maintainer Author

annakrystalli Apr 3, 2024 Maintainer

elray1 Aug 16, 2024 Maintainer

elray1 Aug 29, 2024 Maintainer

elray1 Oct 7, 2024 Maintainer

elray1 Oct 15, 2024 Maintainer

harryhoch Oct 16, 2024 Collaborator

elray1 Oct 25, 2024 Maintainer

zkamvar Oct 25, 2024 Maintainer

elray1 Oct 25, 2024 Maintainer

elray1 Oct 30, 2024 Maintainer

nickreich
Jun 6, 2023
Maintainer

Replies: 18 comments 29 replies

nickreich
Oct 13, 2023
Maintainer Author

nickreich
Oct 25, 2023
Maintainer Author

nickreich
Oct 25, 2023
Maintainer Author

elray1 Oct 25, 2023
Maintainer

nickreich
Jan 18, 2024
Maintainer Author

elray1
Jan 30, 2024
Maintainer

LucieContamin Jan 31, 2024
Maintainer

elray1
Feb 2, 2024
Maintainer

elray1
Feb 10, 2024
Maintainer

nickreich Feb 12, 2024
Maintainer Author

LucieContamin
Feb 20, 2024
Maintainer

elray1 Mar 12, 2024
Maintainer

LucieContamin Mar 12, 2024
Maintainer

elray1
Mar 11, 2024
Maintainer

elray1 Mar 12, 2024
Maintainer

elray1 Mar 12, 2024
Maintainer

elray1
Mar 13, 2024
Maintainer

LucieContamin Mar 27, 2024
Maintainer

elray1 Mar 28, 2024
Maintainer

elray1 Mar 29, 2024
Maintainer

LucieContamin Mar 29, 2024
Maintainer

annakrystalli Apr 2, 2024
Maintainer

elray1
Mar 13, 2024
Maintainer

elray1
Mar 13, 2024
Maintainer

elray1
Mar 14, 2024
Maintainer

nickreich Mar 14, 2024
Maintainer Author

elray1 Mar 14, 2024
Maintainer

nickreich Mar 14, 2024
Maintainer Author

nickreich Mar 14, 2024
Maintainer Author

elray1
Mar 28, 2024
Maintainer

nickreich Mar 28, 2024
Maintainer Author

annakrystalli Apr 3, 2024
Maintainer

elray1
Aug 16, 2024
Maintainer

elray1
Aug 29, 2024
Maintainer

elray1 Oct 7, 2024
Maintainer

elray1 Oct 15, 2024
Maintainer

harryhoch
Oct 16, 2024
Collaborator

elray1
Oct 25, 2024
Maintainer

zkamvar Oct 25, 2024
Maintainer

elray1 Oct 25, 2024
Maintainer

elray1 Oct 30, 2024
Maintainer