Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample-output-type.md: Clarifying questions and editing suggestions #169

Open
zkamvar opened this issue Aug 6, 2024 · 1 comment · Fixed by #223
Open

sample-output-type.md: Clarifying questions and editing suggestions #169

zkamvar opened this issue Aug 6, 2024 · 1 comment · Fixed by #223
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@zkamvar
Copy link
Member

zkamvar commented Aug 6, 2024

I am having a hard time understanding the sample output type chapter. One of the disconnects is that there are concepts that are presented before they are ever introduced (e.g. compound tasks and the implicit compund_idx column) and there are sections that are only pertinent to compound modelling tasks that are not subsections of that section.

Introduction

The introduction for the sample output type needs reworking. From what I've found in the historical documents, it seems that the text in the introduction was written before the sample schema was fleshed out:

## Introduction
The sample `output_type` can be used to represent a probabilistic distribution through a collection of possible future observed values (“samples”) that come out of a predictive model. Depending on the setup of the model and the configuration settings of the hub, different information may be requested or required to identify each sample.
In the hubverse, a “modeling task” is the element that is being predicted and that can be represented by a univariate (e.g., scalar, or single) value. We could also tie this to a tabular representation of data more concretely as a combination of values from a set of task id columns that uniquely define a single prediction. We note that this concept is similar to that of a [“forecast unit” in the scoringutils R package](https://epiforecasts.io/scoringutils/reference/set_forecast_unit.html).
Take the following model_output data for the mean output_type as an example:
| origin_date | horizon | location | output_type| output_type_id | value |
|:----------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: |
| 2024-03-15 | -1 | MA | mean | NA| - |
| 2024-03-15 | 0 | MA | mean | NA| - |
| 2024-03-15 | 1 | MA | mean | NA| - |
In the above table, the three task-id columns origin_date, horizon, and location uniquely define a modeling task. Here, there are three modeling tasks, represented by the tuples <br>
```
{origin_date: “2024-03-15”, horizon: “-1”, location: “MA”}
{origin_date: “2024-03-15”, horizon: “0”, location: “MA”}
{origin_date: “2024-03-15”, horizon: “1”, location: “MA”}
```
In words, the first of these tuples represents a forecast for one day (assume here the horizon is on the timescale of day) prior to the origin date of 2024-03-15 in Massachusetts.

When I read it, I wonder, "Why are we talking about the mean output type? This is the sample output type."

Individual modeling tasks

Why is the compound_idx column here? It appears to be reiterating the grouping of the target column. Is this a column I should be worried about? The text says that it is implicit, but why does it have a name that indicates that it is a column that actually exists?

Why is column_idx not defined in the schema?

Compound modeling tasks

This description could be more specific to the example data presented, highlighting the columns in the text.

we will look at a hub reporting on variant proportions observed at a given location and at a given time. In the table below, a single modeling task is a unique combination of values from the task-id variables origin_date, horizon, variant, and location. In the table below, one set of four rows with the same values in the origin_date, horizon, and location columns but different variant values below represent four predicted variant proportions.

maybe:

we will look at a hub reporting on variant proportions observed in Massachusetts (location) on 2024-03-15 (origin_date) for 7 and 14 day forecasts (horizon). This is represented in the table below showing four variants (AA, BB, CC, and DD) represented over two horizons, giving us eight unique modelling tasks.

What does "Base data: mean output_type" mean?

Base data: mean output_type. In the table below, an entry of “-” stands in for specific values to be provided by the submitter.

Four submissions

NOTE: For each submission, use level 4 headers, not bold text.

Pain points:

  1. the schema says minimum of 90 samples, but we present two samples each.
  2. the sample numbers continue to increase across (as opposed to strictly within) tasks. Why?

Submission A

I am confused as to why the sample numbers keep increasing across the stratification and why there are only two samples per stratum. Should each stratum contain at least 90 samples (according to the schema).

Submission B

I think I understand now that this is showing two samples per stratum, but I'm still confused as to why the sample numbers continue to increase after change in strata.

Are the values for each sample identical?

Submission C

In this example, there is a single compound modeling task which we can describe as “Massachusetts with the origin_date of 2024-03-15”.

"single compound modeling task" is confusing because there are two columns selected here. They do not vary, so it makes some sense in retrospect, but the initial read of this gives some roadblocks.

Submission D

The phrase "plain language" is a bit demotivating for a sentence with six prepositions.

Configuration of output_type_id

This description is clear (but it could use some trimming to reduce complexity) and the table illustrates the validity question better, but I feel that two things need to happen:

  1. The section should be renamed to "Configuration of compound_taskid_set
  2. The section should be a subsection of "Compound modelling tasks/"

I feel like this sentence is at conflict with the meaning of the schema (emphasis mine)

A hub can specify a "compound_taskid_set" field in the metadata for the sample output_type to specify the task-id columns that must be used to define separate sample index values (as present in the output_type_id column). The following table shows how different specifications of this field would impact the validity of each of the example submissions A, B, C, and D.

Given that submission C is valid for all of the schema configurations means that we should use "may" and not "must".

"columns that may be used to define"

Number of samples

As I indicated above, it doesn't make sense why each task gets two samples. Also, I believe this belongs in the "Compound Modelling Tasks" section.

Relationship to output_types

I think this needs to be a subsection of "Compound Modelling Tasks".

@zkamvar zkamvar added the documentation Improvements or additions to documentation label Aug 6, 2024
@zkamvar zkamvar added the question Further information is requested label Aug 6, 2024
@zkamvar
Copy link
Member Author

zkamvar commented Aug 8, 2024

I spoke with @elray1 yesterday and he was able to clarify a couple of things:

  1. when I asked about the number of samples vs the schema: "Was there anything to indicate that these were valid submissions?" I had suspected that was the reason. If nothing else changes, this needs to change. When building understanding in a tutorial, everything must be explicit. The easiest way to do this is to change the schema to allow for a minimum of 2 samples.

  2. The compound_taskid_set are not "strata" (as I had incorrectly interpreted) and the concept behind them is not exactly straightforward. In reality the compound_taskid_set is important in what they do not include: the variables that are used for the joint distribution (I understand the concept of drawing samples from multivariate distributions, but I still need to wrap my head around what exactly this means). One major problem: joint distributions are only ever mentioned in one section of the documentation and that section is not in this chapter (it's in the Formats of Model Output section in the Model Output chapter). It needs to be clearly stated in this chapter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant