`sample-output-type.md`: Clarifying questions and editing suggestions #169

zkamvar · 2024-08-06T19:48:23Z

I am having a hard time understanding the sample output type chapter. One of the disconnects is that there are concepts that are presented before they are ever introduced (e.g. compound tasks and the implicit compund_idx column) and there are sections that are only pertinent to compound modelling tasks that are not subsections of that section.

Introduction

The introduction for the sample output type needs reworking. From what I've found in the historical documents, it seems that the text in the introduction was written before the sample schema was fleshed out:

hubDocs/docs/source/user-guide/sample-output-type.md

Lines 3 to 23 in a273815

    
           ## Introduction   
        
           The sample `output_type` can be used to represent a probabilistic distribution through a collection of possible future observed values (“samples”) that come out of a predictive model. Depending on the setup of the model and the configuration settings of the hub, different information may be requested or required to identify each sample. 
        
           In the hubverse, a “modeling task” is the element that is being predicted and that can be represented by a univariate (e.g., scalar, or single) value. We could also tie this to a tabular representation of data more concretely as a combination of values from a set of task id columns that uniquely define a single prediction. We note that this concept is similar to that of a [“forecast unit” in the scoringutils R package](https://epiforecasts.io/scoringutils/reference/set_forecast_unit.html). 
        
           Take the following model_output data for the mean output_type as an example: 
        
           | origin_date | horizon | location | output_type| output_type_id | value | 
        
           |:----------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | 
        
           | 2024-03-15 | -1 | MA | mean | NA| - | 
        
           | 2024-03-15 |  0 | MA | mean | NA| - | 
        
           | 2024-03-15 |  1 | MA | mean | NA| - | 
        
           In the above table, the three task-id columns origin_date, horizon, and location uniquely define a modeling task. Here, there are three modeling tasks, represented by the tuples <br> 
        
           ``` 
        
           {origin_date: “2024-03-15”, horizon: “-1”, location: “MA”} 
        
           {origin_date: “2024-03-15”, horizon: “0”, location: “MA”} 
        
           {origin_date: “2024-03-15”, horizon: “1”, location: “MA”} 
        
           ``` 
        
           In words, the first of these tuples represents a forecast for one day (assume here the horizon is on the timescale of day) prior to the origin date of 2024-03-15 in Massachusetts.

When I read it, I wonder, "Why are we talking about the mean output type? This is the sample output type."

Individual modeling tasks

Why is the compound_idx column here? It appears to be reiterating the grouping of the target column. Is this a column I should be worried about? The text says that it is implicit, but why does it have a name that indicates that it is a column that actually exists?

Why is column_idx not defined in the schema?

Compound modeling tasks

This description could be more specific to the example data presented, highlighting the columns in the text.

we will look at a hub reporting on variant proportions observed at a given location and at a given time. In the table below, a single modeling task is a unique combination of values from the task-id variables origin_date, horizon, variant, and location. In the table below, one set of four rows with the same values in the origin_date, horizon, and location columns but different variant values below represent four predicted variant proportions.

maybe:

we will look at a hub reporting on variant proportions observed in Massachusetts (location) on 2024-03-15 (origin_date) for 7 and 14 day forecasts (horizon). This is represented in the table below showing four variants (AA, BB, CC, and DD) represented over two horizons, giving us eight unique modelling tasks.

What does "Base data: mean output_type" mean?

hubDocs/docs/source/user-guide/sample-output-type.md

Line 70 in a273815

    
           Base data: mean output_type. In the table below, an entry of “-” stands in for specific values to be provided by the submitter.

Four submissions

NOTE: For each submission, use level 4 headers, not bold text.

Pain points:

the schema says minimum of 90 samples, but we present two samples each.
the sample numbers continue to increase across (as opposed to strictly within) tasks. Why?

Submission A

I am confused as to why the sample numbers keep increasing across the stratification and why there are only two samples per stratum. Should each stratum contain at least 90 samples (according to the schema).

Submission B

I think I understand now that this is showing two samples per stratum, but I'm still confused as to why the sample numbers continue to increase after change in strata.

Are the values for each sample identical?

Submission C

In this example, there is a single compound modeling task which we can describe as “Massachusetts with the origin_date of 2024-03-15”.

"single compound modeling task" is confusing because there are two columns selected here. They do not vary, so it makes some sense in retrospect, but the initial read of this gives some roadblocks.

Submission D

The phrase "plain language" is a bit demotivating for a sentence with six prepositions.

Configuration of `output_type_id`

This description is clear (but it could use some trimming to reduce complexity) and the table illustrates the validity question better, but I feel that two things need to happen:

The section should be renamed to "Configuration of compound_taskid_set
The section should be a subsection of "Compound modelling tasks/"

I feel like this sentence is at conflict with the meaning of the schema (emphasis mine)

A hub can specify a "compound_taskid_set" field in the metadata for the sample output_type to specify the task-id columns that must be used to define separate sample index values (as present in the output_type_id column). The following table shows how different specifications of this field would impact the validity of each of the example submissions A, B, C, and D.

Given that submission C is valid for all of the schema configurations means that we should use "may" and not "must".

"columns that may be used to define"

Number of samples

As I indicated above, it doesn't make sense why each task gets two samples. Also, I believe this belongs in the "Compound Modelling Tasks" section.

Relationship to output_types

I think this needs to be a subsection of "Compound Modelling Tasks".

The text was updated successfully, but these errors were encountered:

zkamvar · 2024-08-08T13:48:12Z

I spoke with @elray1 yesterday and he was able to clarify a couple of things:

when I asked about the number of samples vs the schema: "Was there anything to indicate that these were valid submissions?" I had suspected that was the reason. If nothing else changes, this needs to change. When building understanding in a tutorial, everything must be explicit. The easiest way to do this is to change the schema to allow for a minimum of 2 samples.
The compound_taskid_set are not "strata" (as I had incorrectly interpreted) and the concept behind them is not exactly straightforward. In reality the compound_taskid_set is important in what they do not include: the variables that are used for the joint distribution (I understand the concept of drawing samples from multivariate distributions, but I still need to wrap my head around what exactly this means). One major problem: joint distributions are only ever mentioned in one section of the documentation and that section is not in this chapter (it's in the Formats of Model Output section in the Model Output chapter). It needs to be clearly stated in this chapter.

zkamvar added the documentation Improvements or additions to documentation label Aug 6, 2024

github-project-automation bot added this to hubverse Development overview Aug 6, 2024

github-project-automation bot moved this to Todo in hubverse Development overview Aug 6, 2024

zkamvar added the question Further information is requested label Aug 6, 2024

zkamvar mentioned this issue Dec 12, 2024

Update samples for v4 #223

Merged

zkamvar closed this as completed in #223 Dec 12, 2024

github-project-automation bot moved this from Todo to Done in hubverse Development overview Dec 12, 2024

zkamvar reopened this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sample-output-type.md`: Clarifying questions and editing suggestions #169

`sample-output-type.md`: Clarifying questions and editing suggestions #169

zkamvar commented Aug 6, 2024

zkamvar commented Aug 8, 2024 •

edited

Loading

sample-output-type.md: Clarifying questions and editing suggestions #169

sample-output-type.md: Clarifying questions and editing suggestions #169

Comments

zkamvar commented Aug 6, 2024

Introduction

Individual modeling tasks

Compound modeling tasks

Four submissions

Submission A

Submission B

Submission C

Submission D

Configuration of output_type_id

Number of samples

Relationship to output_types

zkamvar commented Aug 8, 2024 • edited Loading

`sample-output-type.md`: Clarifying questions and editing suggestions #169

`sample-output-type.md`: Clarifying questions and editing suggestions #169

Configuration of `output_type_id`

zkamvar commented Aug 8, 2024 •

edited

Loading