Skip to content

Submission format

Paolo Milano edited this page Aug 26, 2024 · 33 revisions

Projections should be stored as a parquet file in your model-output/team-model folder.

The parquet file must use a standardised file name, and contain specific variable names and values which identify the projections you are submitting. The automatic check validates both the filename and file contents to ensure the file is correct.

File name

Each projection file within the subdirectory should have the following name format:

<round_id>-<team>-<model>.parquet

The <round_id> is defined uniquely for each submission round and disease. It is composed by the season_cycle, identifying the season and the submission cycle, and the disease indicator. The team and model in this file name must match the name of the model-output directory this file is in (and correspond to the team_abbr and model_abbr parameters in the metadata file).

File format

Required variables

The parquet file must be contain only the following columns (in any order). No additional columns are allowed.

column data type description
round_id string The id of the submission round, e.g. '2024_2025_1_FLU', composed by the season cycle ('2024_2025_1') plus the disease ('FLU'). Will be defined for each round.
scenario_id string Id of the scenario as described in the round specifications (e.g. 'A', 'B', ...).
target string One of the targets defined/allowed for the round.
location string One of the ISO 3166-1 alpha-2 (ISO-2) geocodes for the European country. We provide a geocode file to convert between country names and ISO-2 codes or, if using R, you can use the countrycode package.
pop_group string The age bin, or another population breakdown identifier, as defined in the round specs.
horizon integer Values in the horizon column must be an integer indicating the weeks ahead from the origin date corresponding to the predicted value. Each week starts on Monday and ends on Sunday. For more details check the template file for CSV files converting between dates and ISO weeks.
target_end_date date Target date corresponding to the projected value. Values must be a date in the format YYYY-MM-DD.
output_type string One of "quantile" or "sample".
output_type_id string When output_type = "sample" shall be a value from 1 to 300 identifying the stochastic run for sample data. When output_type = "quantile", one of the 23 accepted quantiles, i.e. 0.010 0.025 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.700 0.750 0.800 0.850 0.900 0.950 0.975 0.990 as a string .
value double The value of the prediction for the given target.

(*): The origin date of the scenario simulations will be defined for each round and season_cycle and mentioned explicitly in the github Wiki documentation.

Parquet file format

The "arrow" library can be used to read/write the parquet files in R and in Python, where "pandas" library can be used as well.

For example, in R you can load "arrow" and then:

library("arrow")

file_name <-model-output/team-model/round_id-team-model.parquet# To read "parquet" file format
arrow::read_parquet(filename)

# To write "parquet" file format
arrow::write_parquet(df, file_name)

The following code does the same but using Python and "pandas":

import pandas as pd

file_name = 'model-output/team-model/round_id-team-model.parquet'

# To read "parquet" file format:
df = pd.read_parquet(file_name)

# Write "parquet" file format
df.to_parquet(file_name)