Skip to content

Commit

Permalink
update data README
Browse files Browse the repository at this point in the history
  • Loading branch information
sbfnk committed Dec 8, 2023
1 parent bb167fd commit 02236ac
Show file tree
Hide file tree
Showing 4 changed files with 1,531 additions and 1,555 deletions.
49 changes: 23 additions & 26 deletions data-truth/README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -118,32 +118,30 @@ ggplot2::ggsave("plots/Hospitalisations.svg", p, width = 10.5, height = 7.5)
![Plot of hospitalisations](plots/Hospitalisations.svg)


### Cases and deaths
### Deaths

```{r cases-deaths, echo=FALSE, message=FALSE}
case_death_data <- list()
indicators <- c("Cases", "Deaths")
```{r deaths, echo=FALSE, message=FALSE}
death_data <- list()
indicators <- "Deaths"
for (indicator in indicators) {
case_death_data[[tolower(indicator)]] <-
death_data[[tolower(indicator)]] <-
readr::read_csv(here::here(
"data-truth", "ECDC", paste0("truth_ECDC-Incident ", indicator, ".csv")
),
show_col_types = FALSE
) |>
dplyr::filter(date >= max(snapshot_date) - lubridate::weeks(8))
}
case_locations <- sort(unique(case_death_data[["cases"]]$location_name))
death_locations <- sort(unique(case_death_data[["deaths"]]$location_name))
death_locations <- sort(unique(death_data[["deaths"]]$location_name))
```


- Cases: `r case_locations`
- Deaths: `r death_locations`

```{r case-death-data-warnings, results="asis"}
for (indicator in indicators) {
if (nrow(case_death_data[[tolower(indicator)]] > 0)) {
gaps <- case_death_data[[tolower(indicator)]] |>
if (nrow(death_data[[tolower(indicator)]] > 0)) {
gaps <- death_data[[tolower(indicator)]] |>
dplyr::arrange(date) |>
dplyr::group_by(location_name) |>
dplyr::slice(
Expand All @@ -165,52 +163,52 @@ for (indicator in indicators) {
}
```

We further evaluate forecasts of cases and deaths against data provided [ECDC](https://www.ecdc.europa.eu/), which we recommend using as the basis for corresponding forecasts.
We further evaluate forecasts of deaths against data provided [ECDC](https://www.ecdc.europa.eu/), which we recommend using as the basis for corresponding forecasts.
These data are provided as reported by national health authorities and therefore are not consistent in definition, and care needs to be taken in interpreting them.

One particular issue that affects several of the case/death data streams it the one of right truncation.
One particular issue that affects several of the death data streams it the one of right truncation.
This occurs when these are reported with a delay, and therefore recent data need to be treated as incomplete, posing additional challenges to forecasting such data streams and validating forecasts.

For our visualisations and assesments of forecast performance we treat cases and deaths as *final* 28 days after the reported date.
For our visualisations and assesments of forecast performance we treat deaths as *final* 28 days after the reported date.
Any further revisions will be ignored for the purposes of the Hub.

We provide multiple views of the data in order to facilitate modelling of COVID-19 cases and deaths with a 28 day cutoff.
In the [ECDC/snapshot](ECDC/snapshot) directory we provide weekly snapshots of the COVID-19 case and death data as collated by ECDC, before any further processing is applied.
The data in there are given either as weekly sums of cases/deaths.
We provide multiple views of the data in order to facilitate modelling of COVID-19 deaths with a 28 day cutoff.
In the [ECDC/snapshot](ECDC/snapshot) directory we provide weekly snapshots of the COVID-19 death data as collated by ECDC, before any further processing is applied.
The data in there are given either as weekly sums of deaths.
In the [ECDC/final](OWID/final) directory we provide data that are considered "final", i.e. they stop 28 days before the latest date.
The files in this directory are the ones used for scoring the forecasts for their performance against observed data.

The single datasets in [ECDC/truth_ECDC-Incident Cases](ECDC/truth_ECDC-Incident Cases) and [ECDC/truth_ECDC-Incident Deaths](ECDC/truth_ECDC-Incident Deaths) contain the latest data, where the final versions of the data are included for dates more than 28 days before the latest snapshot date, and the most recent version for any subsequent data.
The single dataset in [ECDC/truth_ECDC-Incident Deaths](ECDC/truth_ECDC-Incident Deaths) contains the latest data, where the final versions of the data are included for dates more than 28 days before the latest snapshot date, and the most recent version for any subsequent data.
These are the dataset recommended for use in models that can take into account the truncation of the data. Please note that the `date` field in this file corresponds to the final day of the week reported, and the data has been shifted back one day to Saturday (instead of Sunday) in that file to comply with the Hub definition of an epidemiological week (Sunday-Saturday).
Past versions of this data set are in the [ECDC/truth](ECDC/truth) directory.

We further provide a set of [recommended cutoffs](ECDC/recommended-cutoffs.csv) for use with these data.
These are estimates of the truncation in the number of weeks that should be cut off the data set if the aim is to have a data set that is not further revised by more than 5%.
The corresponding datasets in [ECDC/truncated_ECDC-Incident Cases.csv](ECDC/truth_ECDC-Incident Cases.csv) and [ECDC/truncated_ECDC-Incident Deaths.csv](ECDC/truth_ECDC-Incident Deaths.csv) have these recent weeks removed and is recommended for use in models that cannot take into account the truncation of the data.
The corresponding datasets in [ECDC/truncated_ECDC-Incident Deaths.csv](ECDC/truth_ECDC-Incident Deaths.csv) have these recent weeks removed and is recommended for use in models that cannot take into account the truncation of the data.

The latest case/death data is plotted below, with the dashed line indicating data expecting to be substanially revised.
The latest death data is plotted below, with the dashed line indicating data expecting to be substanially revised.

```{r weekly_case_death_data, echo = FALSE}
```{r weekly_death_data, echo = FALSE}
for (indicator in indicators) {
duplicate_final <- case_death_data[[tolower(indicator)]] |>
duplicate_final <- death_data[[tolower(indicator)]] |>
dplyr::group_by(location, location_name, source) |>
dplyr::filter(any(status == "expecting revisions"),
status != "expecting revisions") |>
dplyr::filter(date == max(date)) |>
dplyr::ungroup() |>
dplyr::mutate(status = "expecting revisions")
case_death_data[[tolower(indicator)]] <-
case_death_data[[tolower(indicator)]] |>
death_data[[tolower(indicator)]] <-
death_data[[tolower(indicator)]] |>
bind_rows(duplicate_final)
p <- ggplot2::ggplot(
case_death_data[[tolower(indicator)]] |>
death_data[[tolower(indicator)]] |>
filter(status != "expecting revisions"),
ggplot2::aes(x = date, y = value)
) +
ggplot2::geom_line() +
ggplot2::geom_line(
data = case_death_data[[tolower(indicator)]] |>
data = death_data[[tolower(indicator)]] |>
filter(status == "expecting revisions"),
linetype = "dashed"
) +
Expand All @@ -226,7 +224,6 @@ for (indicator in indicators) {
}
```

![Plot of cases](plots/Cases.svg)
![Plot of deaths](plots/Deaths.svg)


Expand Down
64 changes: 29 additions & 35 deletions data-truth/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,59 +65,52 @@ cannot take into account the truncation of the data.
The latest hospitalisation data is plotted below, with the dashed line
indicating data expecting to be substanially revised.

![Plot of hospitalisations](plots/Hospitalisations.svg)
<figure>
<img src="plots/Hospitalisations.svg" alt="Plot of hospitalisations" />
<figcaption aria-hidden="true">Plot of hospitalisations</figcaption>
</figure>

### Cases and deaths

- Cases: Austria, Belgium, Bulgaria, Croatia, Cyprus, Czechia, Denmark,
Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland,
Italy, Latvia, Liechtenstein, Lithuania, Luxembourg, Malta,
Netherlands, Norway, Poland, Portugal, Romania, Slovakia, Slovenia,
Spain, Sweden
### Deaths

- Deaths: Austria, Belgium, Bulgaria, Croatia, Cyprus, Czechia, Denmark,
Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland,
Italy, Latvia, Liechtenstein, Lithuania, Luxembourg, Malta,
Netherlands, Norway, Poland, Portugal, Romania, Slovakia, Slovenia,
Spain, Sweden

- **Data warning!** Recent missing data for cases in: Austria, Croatia,
Cyprus, Denmark, Finland, France, Germany, Liechtenstein, Luxembourg,
Netherlands, and Spain

- **Data warning!** Recent missing data for deaths in: Austria, Belgium,
Croatia, Cyprus, Denmark, Finland, France, Germany, Iceland,
Liechtenstein, Luxembourg, Netherlands, Norway, and Spain
Cyprus, Denmark, Finland, France, Germany, Greece, Hungary, Iceland,
Ireland, Italy, Liechtenstein, Luxembourg, Netherlands, Norway,
Slovakia, and Spain

We further evaluate forecasts of cases and deaths against data provided
We further evaluate forecasts of deaths against data provided
[ECDC](https://www.ecdc.europa.eu/), which we recommend using as the
basis for corresponding forecasts. These data are provided as reported
by national health authorities and therefore are not consistent in
definition, and care needs to be taken in interpreting them.

One particular issue that affects several of the case/death data streams
it the one of right truncation. This occurs when these are reported with
a delay, and therefore recent data need to be treated as incomplete,
One particular issue that affects several of the death data streams it
the one of right truncation. This occurs when these are reported with a
delay, and therefore recent data need to be treated as incomplete,
posing additional challenges to forecasting such data streams and
validating forecasts.

For our visualisations and assesments of forecast performance we treat
cases and deaths as *final* 28 days after the reported date. Any further
revisions will be ignored for the purposes of the Hub.
deaths as *final* 28 days after the reported date. Any further revisions
will be ignored for the purposes of the Hub.

We provide multiple views of the data in order to facilitate modelling
of COVID-19 cases and deaths with a 28 day cutoff. In the
of COVID-19 deaths with a 28 day cutoff. In the
[ECDC/snapshot](ECDC/snapshot) directory we provide weekly snapshots of
the COVID-19 case and death data as collated by ECDC, before any further
the COVID-19 death data as collated by ECDC, before any further
processing is applied. The data in there are given either as weekly sums
of cases/deaths. In the [ECDC/final](OWID/final) directory we provide
data that are considered “final”, i.e. they stop 28 days before the
latest date. The files in this directory are the ones used for scoring
the forecasts for their performance against observed data.

The single datasets in [ECDC/truth_ECDC-Incident
Cases](ECDC/truth_ECDC-Incident%20Cases) and [ECDC/truth_ECDC-Incident
Deaths](ECDC/truth_ECDC-Incident%20Deaths) contain the latest data,
of deaths. In the [ECDC/final](OWID/final) directory we provide data
that are considered “final”, i.e. they stop 28 days before the latest
date. The files in this directory are the ones used for scoring the
forecasts for their performance against observed data.

The single dataset in [ECDC/truth_ECDC-Incident
Deaths](ECDC/truth_ECDC-Incident%20Deaths) contains the latest data,
where the final versions of the data are included for dates more than 28
days before the latest snapshot date, and the most recent version for
any subsequent data. These are the dataset recommended for use in models
Expand All @@ -134,16 +127,17 @@ are estimates of the truncation in the number of weeks that should be
cut off the data set if the aim is to have a data set that is not
further revised by more than 5%. The corresponding datasets in
[ECDC/truncated_ECDC-Incident
Cases.csv](ECDC/truth_ECDC-Incident%20Cases.csv) and
[ECDC/truncated_ECDC-Incident
Deaths.csv](ECDC/truth_ECDC-Incident%20Deaths.csv) have these recent
weeks removed and is recommended for use in models that cannot take into
account the truncation of the data.

The latest case/death data is plotted below, with the dashed line
indicating data expecting to be substanially revised.
The latest death data is plotted below, with the dashed line indicating
data expecting to be substanially revised.

![Plot of cases](plots/Cases.svg) ![Plot of deaths](plots/Deaths.svg)
<figure>
<img src="plots/Deaths.svg" alt="Plot of deaths" />
<figcaption aria-hidden="true">Plot of deaths</figcaption>
</figure>

## Additional data sources

Expand Down
Loading

0 comments on commit 02236ac

Please sign in to comment.