diff --git a/NEWS.md b/NEWS.md index dc3097359..0fb6b386d 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,5 +1,9 @@ # EpiNow2 1.4.9000 +## Documentation + +* Two new vignettes have been added to cover the workflow and example uses. By @sbfnk in #458 and reviewed by @jamesmbaazam. + ## Package * Reduced the number of long-running examples. diff --git a/_pkgdown.yml b/_pkgdown.yml index e83c873fb..94156903a 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -40,6 +40,14 @@ navbar: href: articles/estimate_truncation.html - text: Gaussian Process implementation details href: articles/gaussian_process_implementation_details.html + usage: + text: Usage + - text: Workflow for Rt estimation and forecasting + href: articles/estimate_infections_workflow.html + - text: Examples: estimate_infections() + href: articles/estimate_infections_options.html + - text: Using epinow() for running in production mode + href: articles/epinow.html casestudies: text: Case studies menu: diff --git a/vignettes/.gitignore b/vignettes/.gitignore index 097b24163..2d19fc766 100644 --- a/vignettes/.gitignore +++ b/vignettes/.gitignore @@ -1,2 +1 @@ *.html -*.R diff --git a/vignettes/epinow.Rmd b/vignettes/epinow.Rmd new file mode 100644 index 000000000..bbafcde1f --- /dev/null +++ b/vignettes/epinow.Rmd @@ -0,0 +1,177 @@ +--- +title: "Using epinow() for running in production mode" +output: + rmarkdown::html_vignette: + toc: false + number_sections: false +bibliography: library.bib +csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +vignette: > + %\VignetteIndexEntry{Using epinow() for running in production mode} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + + + +The _EpiNow2_ package contains functionality to run `estimate_infections()` in production mode, i.e. with full logging and saving all relevant outputs and plots to dedicated folders in the hard drive. +This is done with the `epinow()` function, that takes the same options as `estimate_infections()` with some additional options that determine, for example, where output gets stored and what output exactly. +The function can be a useful option when, e.g., running the model daily with updated data on a high-performance computing server to feed into a dashboard. +For more detail on the various model options available, see the [Examples](estimate_infections_options.html) vignette, for more on the general modelling approach the [Workflow](estimate_infections_workflow.html), and for theoretical background the [Model definitions](estimate_infections.html) vignette + +# Running the model on a single region + +To run the model in production mode for a single region, set the parameters up in the same way as for `estimate_infections()` (see the [Workflow](estimate_infections_workflow.html) vignette). +Here we use the example delay and generation time distributions that come with the package. +This should be replaced with parameters relevant to the system that is being studied. + + +```r +library("EpiNow2") +options(mc.cores = 4) +reported_cases <- example_confirmed[1:60] +incubation_period <- get_incubation_period( + disease = "SARS-CoV-2", source = "lauer" +) +reporting_delay <- dist_spec( + mean = convert_to_logmean(2, 1), mean_sd = 0, + sd = convert_to_logsd(2, 1), sd_sd = 0, max = 10 +) +delay <- incubation_period + reporting_delay +generation_time <- get_generation_time( + disease = "SARS-CoV-2", source = "ganyani" +) +rt_prior <- list(mean = 2, sd = 0.1) +``` + +We can then run the `epinow()` function with the same arguments as `estimate_infections()`. + + +```r +res <- epinow(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), + target_folder = "results" +) +#> Logging threshold set at INFO for the EpiNow2 logger +#> Writing EpiNow2 logs to the console and: /var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T//Rtmp2g83K0/regional-epinow/2020-04-21.log +#> Logging threshold set at INFO for the EpiNow2.epinow logger +#> Writing EpiNow2.epinow logs to the console and: /var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T//Rtmp2g83K0/epinow/2020-04-21.log +#> WARN [2023-10-02 20:01:50] epinow: There were 17 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. - +#> WARN [2023-10-02 20:01:50] epinow: Examine the pairs() plot to diagnose sampling problems +#> - +res$plots$R +#> NULL +``` + +The initial messages here indicate where log files can be found, and summarised results and plots are in the folder given by `target_folder` (here: `results/`). + +# Running the model simultaneously on multiple regions + +The package also contains functionality to conduct inference contemporaneously (if separately) in production mode on multiple time series, e.g. to run the model on multiple regions. +This is done with the `regional_epinow()` function. + +Say, for example, we construct a dataset containing two regions, `testland` and `realland` (in this simple example both containing the same case data). + + +```r +cases <- example_confirmed[1:60] +cases <- data.table::rbindlist(list( + data.table::copy(cases)[, region := "testland"], + cases[, region := "realland"] + )) +``` + +To then run this on multiple regions using the default options above, we could use + + +```r +region_rt <- regional_epinow( + reported_cases = cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), +) +#> INFO [2023-10-02 20:01:56] Producing following optional outputs: regions, summary, samples, plots, latest +#> Logging threshold set at INFO for the EpiNow2 logger +#> Writing EpiNow2 logs to the console and: /var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T//Rtmp2g83K0/regional-epinow/2020-04-21.log +#> Logging threshold set at INFO for the EpiNow2.epinow logger +#> Writing EpiNow2.epinow logs to: /var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T//Rtmp2g83K0/epinow/2020-04-21.log +#> INFO [2023-10-02 20:01:56] Reporting estimates using data up to: 2020-04-21 +#> INFO [2023-10-02 20:01:56] No target directory specified so returning output +#> INFO [2023-10-02 20:01:56] Producing estimates for: testland, realland +#> INFO [2023-10-02 20:01:56] Regions excluded: none +#> INFO [2023-10-02 20:03:16] Completed estimates for: testland +#> INFO [2023-10-02 20:04:44] Completed estimates for: realland +#> INFO [2023-10-02 20:04:44] Completed regional estimates +#> INFO [2023-10-02 20:04:44] Regions with estimates: 2 +#> INFO [2023-10-02 20:04:44] Regions with runtime errors: 0 +#> INFO [2023-10-02 20:04:44] Producing summary +#> INFO [2023-10-02 20:04:44] No summary directory specified so returning summary output +#> INFO [2023-10-02 20:04:45] No target directory specified so returning timings +## summary +region_rt$summary$summarised_results$table +#> Region New confirmed cases by infection date +#> 1: realland 2243 (1148 -- 4405) +#> 2: testland 2327 (1127 -- 4494) +#> Expected change in daily cases Effective reproduction no. +#> 1: Likely decreasing 0.88 (0.62 -- 1.2) +#> 2: Likely decreasing 0.89 (0.61 -- 1.2) +#> Rate of growth Doubling/halving time (days) +#> 1: -0.028 (-0.095 -- 0.037) -25 (19 -- -7.3) +#> 2: -0.025 (-0.098 -- 0.042) -27 (17 -- -7.1) +## plot +region_rt$summary$plots$R +``` + +![plot of chunk regional_epinow](figure/regional_epinow-1.png) + +If instead, we wanted to use the Gaussian Process for `testland` and a weekly random walk for `realland` we could specify these separately using the `opts_list()` and `update_list()` functions + + +```r +gp <- opts_list(gp_opts(), cases) +gp <- update_list(gp, list(realland = NULL)) +rt <- opts_list(rt_opts(), cases, realland = rt_opts(rw = 7)) +region_separate_rt <- regional_epinow( + reported_cases = cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt, gp = gp, +) +#> INFO [2023-10-02 20:04:45] Producing following optional outputs: regions, summary, samples, plots, latest +#> Logging threshold set at INFO for the EpiNow2 logger +#> Writing EpiNow2 logs to the console and: /var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T//Rtmp2g83K0/regional-epinow/2020-04-21.log +#> Logging threshold set at INFO for the EpiNow2.epinow logger +#> Writing EpiNow2.epinow logs to: /var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T//Rtmp2g83K0/epinow/2020-04-21.log +#> INFO [2023-10-02 20:04:45] Reporting estimates using data up to: 2020-04-21 +#> INFO [2023-10-02 20:04:45] No target directory specified so returning output +#> INFO [2023-10-02 20:04:45] Producing estimates for: testland, realland +#> INFO [2023-10-02 20:04:45] Regions excluded: none +#> INFO [2023-10-02 20:06:46] Completed estimates for: testland +#> INFO [2023-10-02 20:07:21] Completed estimates for: realland +#> INFO [2023-10-02 20:07:21] Completed regional estimates +#> INFO [2023-10-02 20:07:21] Regions with estimates: 2 +#> INFO [2023-10-02 20:07:21] Regions with runtime errors: 0 +#> INFO [2023-10-02 20:07:21] Producing summary +#> INFO [2023-10-02 20:07:21] No summary directory specified so returning summary output +#> INFO [2023-10-02 20:07:21] No target directory specified so returning timings +## summary +region_separate_rt$summary$summarised_results$table +#> Region New confirmed cases by infection date +#> 1: realland 2168 (1094 -- 4167) +#> 2: testland 2233 (1014 -- 4477) +#> Expected change in daily cases Effective reproduction no. +#> 1: Likely decreasing 0.86 (0.61 -- 1.2) +#> 2: Likely decreasing 0.88 (0.56 -- 1.2) +#> Rate of growth Doubling/halving time (days) +#> 1: -0.032 (-0.096 -- 0.034) -22 (20 -- -7.2) +#> 2: -0.028 (-0.11 -- 0.044) -25 (16 -- -6.1) +## plot +region_separate_rt$summary$plots$R +``` + +![plot of chunk regional_epinow_multiple](figure/regional_epinow_multiple-1.png) diff --git a/vignettes/epinow.Rmd.orig b/vignettes/epinow.Rmd.orig new file mode 100644 index 000000000..943966841 --- /dev/null +++ b/vignettes/epinow.Rmd.orig @@ -0,0 +1,113 @@ +--- +title: "Using epinow() for running in production mode" +output: + rmarkdown::html_vignette: + toc: false + number_sections: false +bibliography: library.bib +csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +vignette: > + %\VignetteIndexEntry{Using epinow() for running in production mode} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.width = 6.5, + fig.height = 6.5 +) +``` + +The _EpiNow2_ package contains functionality to run `estimate_infections()` in production mode, i.e. with full logging and saving all relevant outputs and plots to dedicated folders in the hard drive. +This is done with the `epinow()` function, that takes the same options as `estimate_infections()` with some additional options that determine, for example, where output gets stored and what output exactly. +The function can be a useful option when, e.g., running the model daily with updated data on a high-performance computing server to feed into a dashboard. +For more detail on the various model options available, see the [Examples](estimate_infections_options.html) vignette, for more on the general modelling approach the [Workflow](estimate_infections_workflow.html), and for theoretical background the [Model definitions](estimate_infections.html) vignette + +# Running the model on a single region + +To run the model in production mode for a single region, set the parameters up in the same way as for `estimate_infections()` (see the [Workflow](estimate_infections_workflow.html) vignette). +Here we use the example delay and generation time distributions that come with the package. +This should be replaced with parameters relevant to the system that is being studied. + +```{r setup } +library("EpiNow2") +options(mc.cores = 4) +reported_cases <- example_confirmed[1:60] +incubation_period <- get_incubation_period( + disease = "SARS-CoV-2", source = "lauer" +) +reporting_delay <- dist_spec( + mean = convert_to_logmean(2, 1), mean_sd = 0, + sd = convert_to_logsd(2, 1), sd_sd = 0, max = 10 +) +delay <- incubation_period + reporting_delay +generation_time <- get_generation_time( + disease = "SARS-CoV-2", source = "ganyani" +) +rt_prior <- list(mean = 2, sd = 0.1) +``` + +We can then run the `epinow()` function with the same arguments as `estimate_infections()`. + +```{r epinow} +res <- epinow(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), + target_folder = "results" +) +res$plots$R +``` + +The initial messages here indicate where log files can be found, and summarised results and plots are in the folder given by `target_folder` (here: `results/`). + +# Running the model simultaneously on multiple regions + +The package also contains functionality to conduct inference contemporaneously (if separately) in production mode on multiple time series, e.g. to run the model on multiple regions. +This is done with the `regional_epinow()` function. + +Say, for example, we construct a dataset containing two regions, `testland` and `realland` (in this simple example both containing the same case data). + +```{r construct_regional_cases} +cases <- example_confirmed[1:60] +cases <- data.table::rbindlist(list( + data.table::copy(cases)[, region := "testland"], + cases[, region := "realland"] + )) +``` + +To then run this on multiple regions using the default options above, we could use + +```{r regional_epinow, fig.width = 8} +region_rt <- regional_epinow( + reported_cases = cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), +) +## summary +region_rt$summary$summarised_results$table +## plot +region_rt$summary$plots$R +``` + +If instead, we wanted to use the Gaussian Process for `testland` and a weekly random walk for `realland` we could specify these separately using the `opts_list()` and `update_list()` functions + +```{r regional_epinow_multiple, fig.width = 8} +gp <- opts_list(gp_opts(), cases) +gp <- update_list(gp, list(realland = NULL)) +rt <- opts_list(rt_opts(), cases, realland = rt_opts(rw = 7)) +region_separate_rt <- regional_epinow( + reported_cases = cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt, gp = gp, +) +## summary +region_separate_rt$summary$summarised_results$table +## plot +region_separate_rt$summary$plots$R +``` diff --git a/vignettes/estimate_infections_options.Rmd b/vignettes/estimate_infections_options.Rmd new file mode 100644 index 000000000..3e1c664b1 --- /dev/null +++ b/vignettes/estimate_infections_options.Rmd @@ -0,0 +1,522 @@ +--- +title: "Examples: estimate_infections()" +output: + rmarkdown::html_vignette: + toc: true + number_sections: true +bibliography: library.bib +csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +vignette: > + %\VignetteIndexEntry{Examples: estimate_infections()} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + + + +The `estimate_infections()` function encodes a range of different model options. +In this vignette we apply some of these to the example data provided with the _EpiNow2_ package, highlighting differences in inference results and run times. +It is not meant as a comprehensive exploration of all the functionality in the package, but intended to give users a flavour of the kind of model options that exist for reproduction number estimation and forecasting within the package, and the differences in computational speed between them. +For mathematical detail on the model please consult the [model definition](estimate_infections.html) vignette, and for a more general description of the use of the function, the [estimate_infections workflow](estimate_infections_workflow.html) vignette. + +# Set up + +We first load the _EpiNow2_ package and also the _rstan_ package that we will use later to show the differences in run times between different model options. + + +```r +library("EpiNow2") +#> Loading required namespace: V8 +library("rstan") +#> Loading required package: StanHeaders +#> +#> rstan version 2.26.24 (Stan version 2.26.1) +#> For execution on a local, multicore CPU with excess RAM we recommend calling +#> options(mc.cores = parallel::detectCores()). +#> To avoid recompilation of unchanged Stan programs, we recommend calling +#> rstan_options(auto_write = TRUE) +#> For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions, +#> change `threads_per_chain` option: +#> rstan_options(threads_per_chain = 1) +``` + +In this examples we set the number of cores to use to 4 but the optimal value here will depend on the computing resources available. + + +```r +options(mc.cores = 4) +``` + +# Data + +We will use an example data set that is included in the package, representing an outbreak of COVID-19 with an initial rapid increase followed by decreasing incidence. + + +```r +library("ggplot2") +reported_cases <- example_confirmed[1:60] +ggplot(reported_cases, aes(x = date, y = confirm)) + + geom_col() + + theme_minimal() + + xlab("Date") + + ylab("Cases") +``` + +![plot of chunk data](figure/data-1.png) + +# Parameters + +Before running the model we need to decide on some parameter values, in particular any delays, the generation time, and a prior on the initial reproduction number. + +## Delays: incubation period and reporting delay + +Delays reflect the time that passes between infection and reporting, if these exist. In this example, we assume two delays, an _incubation period_ (i.e. delay from infection to symptom onset) and a _reporting delay_ (i.e. the delay from symptom onset to being recorded as a symptomatic case). These delays are usually not the same for everyone and are instead characterised by a distribution. For the incubation period we use an example from the literature that is included in the package. + + +```r +incubation_period <- get_incubation_period( + disease = "SARS-CoV-2", source = "lauer" +) +incubation_period +#> +#> Uncertain lognormal distribution with (untruncated) logmean 1.6 (SD 0.064) and logSD 0.42 (SD 0.069) +``` + +For the reporting delay, we use a lognormal distribution with mean of 2 days and standard deviation of 1 day. +Note that the mean and standard deviation must be converted to the log scale, which can be done using the `convert_log_logmean()` function. + + +```r +reporting_delay <- dist_spec( + mean = convert_to_logmean(2, 1), mean_sd = 0, + sd = convert_to_logsd(2, 1), sd_sd = 0, max = 10 +) +reporting_delay +#> +#> Fixed distribution with PMF [0.11 0.48 0.27 0.093 0.029 0.0096 0.0033 0.0012 0.00045 0.00018] +``` + +_EpiNow2_ provides a feature that allows us to combine these delays into one by summing them up + + +```r +delay <- incubation_period + reporting_delay +delay +#> +#> Combination of delay distributions: +#> Uncertain lognormal distribution with (untruncated) logmean 1.6 (SD 0.064) and logSD 0.42 (SD 0.069) +#> Fixed distribution with PMF [0.11 0.48 0.27 0.093 0.029 0.0096 0.0033 0.0012 0.00045 0.00018] +``` + +## Generation time + +If we want to estimate the reproduction number we need to provide a distribution of generation times. Here again we use an example from the literature that is included with the package. + + +```r +generation_time <- get_generation_time( + disease = "SARS-CoV-2", source = "ganyani" +) +generation_time +#> +#> Uncertain gamma distribution with (untruncated) mean 3.6 (SD 0.71) and SD 3.1 (SD 0.77) +``` + +## Initial reproduction number + +Lastly we need to choose a prior for the initial value of the reproduction number. This is assumed by the model to be normally distributed and we can set the mean and the standard deviation. We decide to set the mean to 2 and the standard deviation to 1. + + +```r +rt_prior <- list(mean = 2, sd = 0.1) +``` + +# Running the model + +We are now ready to run the model and will in the following show a number of possible options for doing so. + +## Default options + +By default the model uses a renewal equation for infections and a Gaussian Process prior for the reproduction number. +Putting all the data and parameters together and tweaking the Gaussian Process to have a shorter length scale prior than the default we run. + + +```r +def <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior) +) +#> Warning: There were 18 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. +#> Warning: Examine the pairs() plot to diagnose sampling problems +# summarise results +summary(def) +#> measure estimate +#> 1: New confirmed cases by infection date 2248 (1143 -- 4365) +#> 2: Expected change in daily cases Likely decreasing +#> 3: Effective reproduction no. 0.88 (0.62 -- 1.2) +#> 4: Rate of growth -0.028 (-0.095 -- 0.037) +#> 5: Doubling/halving time (days) -25 (19 -- -7.3) +# elapsed time (in seconds) +get_elapsed_time(def$fit) +#> warmup sample +#> chain:1 47.203 35.025 +#> chain:2 44.973 36.144 +#> chain:3 55.958 31.756 +#> chain:4 42.884 35.091 +# summary plot +plot(def) +``` + +![plot of chunk default](figure/default-1.png) + +## Reducing the accuracy of the approximate Gaussian Process + +To speed up the calculation of the Gaussian Process we could decrease its accuracy, e.g. decrease the proportion of time points to use as basis functions from the default of 0.2 to 0.1. + + +```r +agp <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), + gp = gp_opts(basis_prop = 0.1) +) +#> Warning: There were 9 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. +#> Warning: Examine the pairs() plot to diagnose sampling problems +# summarise results +summary(agp) +#> measure estimate +#> 1: New confirmed cases by infection date 2321 (1229 -- 4588) +#> 2: Expected change in daily cases Likely decreasing +#> 3: Effective reproduction no. 0.89 (0.65 -- 1.2) +#> 4: Rate of growth -0.024 (-0.087 -- 0.041) +#> 5: Doubling/halving time (days) -28 (17 -- -7.9) +# elapsed time (in seconds) +get_elapsed_time(agp$fit) +#> warmup sample +#> chain:1 28.912 48.824 +#> chain:2 25.219 30.999 +#> chain:3 32.968 32.434 +#> chain:4 37.553 30.607 +# summary plot +plot(agp) +``` + +![plot of chunk lower_accuracy](figure/lower_accuracy-1.png) + +## Adjusting for future susceptible depletion + +We might want to adjust for future susceptible depletion. +Here, we do so by setting the population to 1000000 and projecting the reproduction number from the latest estimate (rather than the default, which fixes the reproduction number to an earlier time point based on the given reporting delays). +Note that this only affects the forecasts and is done using a crude adjustment (see the [model definition](estimate_infections.html)). + + +```r +dep <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts( + prior = rt_prior, + pop = 1000000, future = "latest" + ) +) +#> Warning: There were 3 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. +#> Warning: Examine the pairs() plot to diagnose sampling problems +# summarise results +summary(dep) +#> measure estimate +#> 1: New confirmed cases by infection date 2269 (1117 -- 4377) +#> 2: Expected change in daily cases Likely decreasing +#> 3: Effective reproduction no. 0.88 (0.6 -- 1.2) +#> 4: Rate of growth -0.027 (-0.1 -- 0.037) +#> 5: Doubling/halving time (days) -26 (19 -- -6.9) +# elapsed time (in seconds) +get_elapsed_time(dep$fit) +#> warmup sample +#> chain:1 44.156 39.473 +#> chain:2 48.509 61.754 +#> chain:3 49.279 47.867 +#> chain:4 53.546 38.343 +# summary plot +plot(dep) +``` + +![plot of chunk susceptible_depletion](figure/susceptible_depletion-1.png) + +## Adjusting for truncation of the most recent data + +We might further want to adjust for right-truncation of recent data estimated using the [estimate_truncation](estimate_truncation.html) model. +Here, instead of doing so we assume that we know about truncation with mean of 1/2 day, sd 1/2 day, following a lognormal distribution and with a maximum of three days. + + +```r +trunc_dist <- dist_spec( + mean = convert_to_logmean(0.5, 0.5), mean_sd = 0.1, + sd = convert_to_logsd(0.5, 0.5), sd_sd = 0.1, + max = 3 +) +``` + +We can then use this in the `esimtate_infections()` function using the `truncation` option. + + +```r +trunc <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + truncation = trunc_opts(trunc_dist), + rt = rt_opts(prior = rt_prior) +) +#> Warning: There were 2 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. +#> Warning: Examine the pairs() plot to diagnose sampling problems +# summarise results +summary(trunc) +#> measure estimate +#> 1: New confirmed cases by infection date 2522 (1296 -- 4745) +#> 2: Expected change in daily cases Likely decreasing +#> 3: Effective reproduction no. 0.92 (0.66 -- 1.2) +#> 4: Rate of growth -0.018 (-0.085 -- 0.045) +#> 5: Doubling/halving time (days) -38 (15 -- -8.2) +# elapsed time (in seconds) +get_elapsed_time(trunc$fit) +#> warmup sample +#> chain:1 50.283 45.799 +#> chain:2 41.206 45.834 +#> chain:3 37.676 73.545 +#> chain:4 52.158 47.402 +# summary plot +plot(trunc) +``` + +![plot of chunk truncation](figure/truncation-1.png) + +## Projecting the reproduction number with the Gaussian Process + +Instead of keeping the reproduction number fixed from a certain time point we might want to extrapolate the Gaussian Process into the future. +This will lead to wider uncertainty, and the researcher should check whether this or fixing the reproduction number from an earlier is desirable. + + +```r +project_rt <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts( + prior = rt_prior, future = "project" + ) +) +#> Warning: There were 8 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. +#> Warning: Examine the pairs() plot to diagnose sampling problems +# summarise results +summary(project_rt) +#> measure estimate +#> 1: New confirmed cases by infection date 2236 (1068 -- 4349) +#> 2: Expected change in daily cases Likely decreasing +#> 3: Effective reproduction no. 0.87 (0.59 -- 1.2) +#> 4: Rate of growth -0.028 (-0.1 -- 0.04) +#> 5: Doubling/halving time (days) -25 (17 -- -6.7) +# elapsed time (in seconds) +get_elapsed_time(project_rt$fit) +#> warmup sample +#> chain:1 31.483 37.108 +#> chain:2 35.041 36.827 +#> chain:3 35.983 41.154 +#> chain:4 37.730 56.884 +# summary plot +plot(project_rt) +``` + +![plot of chunk gp_projection](figure/gp_projection-1.png) + +## Fixed reproduction number + +We might want to estimate a fixed reproduction number, i.e. assume that it does not change. + + +```r +fixed <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + gp = NULL +) +# summarise results +summary(fixed) +#> measure estimate +#> 1: New confirmed cases by infection date 15823 (8907 -- 29156) +#> 2: Expected change in daily cases Increasing +#> 3: Effective reproduction no. 1.2 (1.1 -- 1.3) +#> 4: Rate of growth 0.038 (0.026 -- 0.05) +#> 5: Doubling/halving time (days) 18 (14 -- 26) +# elapsed time (in seconds) +get_elapsed_time(fixed$fit) +#> warmup sample +#> chain:1 2.195 0.946 +#> chain:2 1.797 1.289 +#> chain:3 2.017 0.947 +#> chain:4 1.555 1.498 +# summary plot +plot(fixed) +``` + +![plot of chunk fixed](figure/fixed-1.png) + +## Breakpoints + +Instead of assuming the reproduction number varies freely or is fixed, we can assume that it is fixed but with breakpoints. +This can be done by adding a `breakpoint` column to the reported case data set. +e.g. if we think that the reproduction number was constant but would like to allow it to change on the 16th of March 2020 we would define a new case data set using + + +```r +bp_cases <- data.table::copy(reported_cases) +bp_cases <- bp_cases[, + breakpoint := ifelse(date == as.Date("2020-03-16"), 1, 0) +] +``` + +We then use this instead of `reported_cases` in the `estimate_infections()` function: + + +```r +bkp <- estimate_infections(bp_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), + gp = NULL +) +# summarise results +summary(bkp) +#> measure estimate +#> 1: New confirmed cases by infection date 2465 (2051 -- 3002) +#> 2: Expected change in daily cases Decreasing +#> 3: Effective reproduction no. 0.91 (0.88 -- 0.93) +#> 4: Rate of growth -0.02 (-0.026 -- -0.015) +#> 5: Doubling/halving time (days) -34 (-46 -- -26) +# elapsed time (in seconds) +get_elapsed_time(bkp$fit) +#> warmup sample +#> chain:1 3.299 4.363 +#> chain:2 3.731 4.280 +#> chain:3 3.246 3.890 +#> chain:4 3.468 4.295 +# summary plot +plot(bkp) +``` + +![plot of chunk bp](figure/bp-1.png) + +## Weekly random walk + +Instead of a smooth Gaussian Process we might want the reproduction number to change step-wise, e.g. every week. +This can be achieved using the `rw` option which defines the length of the time step in a random walk that the reproduction number is assumed to follow. + + +```r +rw <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior, rw = 7), + gp = NULL +) +# summarise results +summary(rw) +#> measure estimate +#> 1: New confirmed cases by infection date 2125 (1162 -- 3967) +#> 2: Expected change in daily cases Likely decreasing +#> 3: Effective reproduction no. 0.86 (0.64 -- 1.1) +#> 4: Rate of growth -0.032 (-0.088 -- 0.028) +#> 5: Doubling/halving time (days) -22 (24 -- -7.8) +# elapsed time (in seconds) +get_elapsed_time(rw$fit) +#> warmup sample +#> chain:1 10.352 12.955 +#> chain:2 9.689 14.484 +#> chain:3 10.946 14.684 +#> chain:4 8.797 10.092 +# summary plot +plot(rw) +``` + +![plot of chunk weekly_rw](figure/weekly_rw-1.png) + +## No delays + +Whilst _EpiNow2_ allows the user to specify delays, it can also run directly on the data as does e.g. the [EpiEstim](https://CRAN.R-project.org/package=EpiEstim) package. + + +```r +no_delay <- estimate_infections( + reported_cases, + generation_time = generation_time_opts(generation_time) +) +#> Warning: There were 9 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. +#> Warning: Examine the pairs() plot to diagnose sampling problems +# summarise results +summary(no_delay) +#> measure estimate +#> 1: New confirmed cases by infection date 2770 (2323 -- 3317) +#> 2: Expected change in daily cases Decreasing +#> 3: Effective reproduction no. 0.88 (0.77 -- 0.98) +#> 4: Rate of growth -0.028 (-0.053 -- -0.0035) +#> 5: Doubling/halving time (days) -25 (-200 -- -13) +# elapsed time (in seconds) +get_elapsed_time(no_delay$fit) +#> warmup sample +#> chain:1 42.824 38.712 +#> chain:2 49.368 42.698 +#> chain:3 43.148 36.002 +#> chain:4 37.123 36.357 +# summary plot +plot(no_delay) +``` + +![plot of chunk no_delays](figure/no_delays-1.png) + +## Non-parametric infection model + +The package also includes a non-parametric infection model. +This runs much faster but does not use the renewal equation to generate infections. +Because of this none of the options defining the behaviour of the reproduction number are available in this case, limiting user choice and model generality. +It also means that the model is questionable for forecasting, which is why were here set the predictive horizon to 0. + + +```r +non_parametric <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = NULL, + backcalc = backcalc_opts(), + horizon = 0 +) +# summarise results +summary(non_parametric) +#> measure estimate +#> 1: New confirmed cases by infection date 2328 (1740 -- 3059) +#> 2: Expected change in daily cases Decreasing +#> 3: Effective reproduction no. 0.89 (0.78 -- 0.99) +#> 4: Rate of growth -0.024 (-0.049 -- -0.0012) +#> 5: Doubling/halving time (days) -29 (-590 -- -14) +# elapsed time (in seconds) +get_elapsed_time(non_parametric$fit) +#> warmup sample +#> chain:1 4.597 1.014 +#> chain:2 3.146 1.074 +#> chain:3 3.497 1.129 +#> chain:4 3.992 0.992 +# summary plot +plot(non_parametric) +``` + +![plot of chunk nonparametric](figure/nonparametric-1.png) diff --git a/vignettes/estimate_infections_options.Rmd.orig b/vignettes/estimate_infections_options.Rmd.orig new file mode 100644 index 000000000..30a4475c3 --- /dev/null +++ b/vignettes/estimate_infections_options.Rmd.orig @@ -0,0 +1,332 @@ +--- +title: "Examples: estimate_infections()" +output: + rmarkdown::html_vignette: + toc: true + number_sections: true +bibliography: library.bib +csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +vignette: > + %\VignetteIndexEntry{Examples: estimate_infections()} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.width = 6.5, + fig.height = 6.5 +) +``` + +The `estimate_infections()` function encodes a range of different model options. +In this vignette we apply some of these to the example data provided with the _EpiNow2_ package, highlighting differences in inference results and run times. +It is not meant as a comprehensive exploration of all the functionality in the package, but intended to give users a flavour of the kind of model options that exist for reproduction number estimation and forecasting within the package, and the differences in computational speed between them. +For mathematical detail on the model please consult the [model definition](estimate_infections.html) vignette, and for a more general description of the use of the function, the [estimate_infections workflow](estimate_infections_workflow.html) vignette. + +# Set up + +We first load the _EpiNow2_ package and also the _rstan_ package that we will use later to show the differences in run times between different model options. + +```{r packages} +library("EpiNow2") +library("rstan") +``` + +In this examples we set the number of cores to use to 4 but the optimal value here will depend on the computing resources available. + +```{r} +options(mc.cores = 4) +``` + +# Data + +We will use an example data set that is included in the package, representing an outbreak of COVID-19 with an initial rapid increase followed by decreasing incidence. + +```{r data, fig.height = 4} +library("ggplot2") +reported_cases <- example_confirmed[1:60] +ggplot(reported_cases, aes(x = date, y = confirm)) + + geom_col() + + theme_minimal() + + xlab("Date") + + ylab("Cases") +``` + +# Parameters + +Before running the model we need to decide on some parameter values, in particular any delays, the generation time, and a prior on the initial reproduction number. + +## Delays: incubation period and reporting delay + +Delays reflect the time that passes between infection and reporting, if these exist. In this example, we assume two delays, an _incubation period_ (i.e. delay from infection to symptom onset) and a _reporting delay_ (i.e. the delay from symptom onset to being recorded as a symptomatic case). These delays are usually not the same for everyone and are instead characterised by a distribution. For the incubation period we use an example from the literature that is included in the package. + +```{r incubation_period} +incubation_period <- get_incubation_period( + disease = "SARS-CoV-2", source = "lauer" +) +incubation_period +``` + +For the reporting delay, we use a lognormal distribution with mean of 2 days and standard deviation of 1 day. +Note that the mean and standard deviation must be converted to the log scale, which can be done using the `convert_log_logmean()` function. + +```{r reporting_delay} +reporting_delay <- dist_spec( + mean = convert_to_logmean(2, 1), mean_sd = 0, + sd = convert_to_logsd(2, 1), sd_sd = 0, max = 10 +) +reporting_delay +``` + +_EpiNow2_ provides a feature that allows us to combine these delays into one by summing them up + +```{r delay} +delay <- incubation_period + reporting_delay +delay +``` + +## Generation time + +If we want to estimate the reproduction number we need to provide a distribution of generation times. Here again we use an example from the literature that is included with the package. + +```{r generation_time} +generation_time <- get_generation_time( + disease = "SARS-CoV-2", source = "ganyani" +) +generation_time +``` + +## Initial reproduction number + +Lastly we need to choose a prior for the initial value of the reproduction number. This is assumed by the model to be normally distributed and we can set the mean and the standard deviation. We decide to set the mean to 2 and the standard deviation to 1. + +```{r initial_r} +rt_prior <- list(mean = 2, sd = 0.1) +``` + +# Running the model + +We are now ready to run the model and will in the following show a number of possible options for doing so. + +## Default options + +By default the model uses a renewal equation for infections and a Gaussian Process prior for the reproduction number. +Putting all the data and parameters together and tweaking the Gaussian Process to have a shorter length scale prior than the default we run. + +```{r default} +def <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior) +) +# summarise results +summary(def) +# elapsed time (in seconds) +get_elapsed_time(def$fit) +# summary plot +plot(def) +``` + +## Reducing the accuracy of the approximate Gaussian Process + +To speed up the calculation of the Gaussian Process we could decrease its accuracy, e.g. decrease the proportion of time points to use as basis functions from the default of 0.2 to 0.1. + +```{r lower_accuracy} +agp <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), + gp = gp_opts(basis_prop = 0.1) +) +# summarise results +summary(agp) +# elapsed time (in seconds) +get_elapsed_time(agp$fit) +# summary plot +plot(agp) +``` + +## Adjusting for future susceptible depletion + +We might want to adjust for future susceptible depletion. +Here, we do so by setting the population to 1000000 and projecting the reproduction number from the latest estimate (rather than the default, which fixes the reproduction number to an earlier time point based on the given reporting delays). +Note that this only affects the forecasts and is done using a crude adjustment (see the [model definition](estimate_infections.html)). + +```{r susceptible_depletion} +dep <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts( + prior = rt_prior, + pop = 1000000, future = "latest" + ) +) +# summarise results +summary(dep) +# elapsed time (in seconds) +get_elapsed_time(dep$fit) +# summary plot +plot(dep) +``` + +## Adjusting for truncation of the most recent data + +We might further want to adjust for right-truncation of recent data estimated using the [estimate_truncation](estimate_truncation.html) model. +Here, instead of doing so we assume that we know about truncation with mean of 1/2 day, sd 1/2 day, following a lognormal distribution and with a maximum of three days. + +```{r define_truncation} +trunc_dist <- dist_spec( + mean = convert_to_logmean(0.5, 0.5), mean_sd = 0.1, + sd = convert_to_logsd(0.5, 0.5), sd_sd = 0.1, + max = 3 +) +``` + +We can then use this in the `esimtate_infections()` function using the `truncation` option. + +```{r truncation} +trunc <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + truncation = trunc_opts(trunc_dist), + rt = rt_opts(prior = rt_prior) +) +# summarise results +summary(trunc) +# elapsed time (in seconds) +get_elapsed_time(trunc$fit) +# summary plot +plot(trunc) +``` + +## Projecting the reproduction number with the Gaussian Process + +Instead of keeping the reproduction number fixed from a certain time point we might want to extrapolate the Gaussian Process into the future. +This will lead to wider uncertainty, and the researcher should check whether this or fixing the reproduction number from an earlier is desirable. + +```{r gp_projection} +project_rt <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts( + prior = rt_prior, future = "project" + ) +) +# summarise results +summary(project_rt) +# elapsed time (in seconds) +get_elapsed_time(project_rt$fit) +# summary plot +plot(project_rt) +``` + +## Fixed reproduction number + +We might want to estimate a fixed reproduction number, i.e. assume that it does not change. + +```{r fixed} +fixed <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + gp = NULL +) +# summarise results +summary(fixed) +# elapsed time (in seconds) +get_elapsed_time(fixed$fit) +# summary plot +plot(fixed) +``` + +## Breakpoints + +Instead of assuming the reproduction number varies freely or is fixed, we can assume that it is fixed but with breakpoints. +This can be done by adding a `breakpoint` column to the reported case data set. +e.g. if we think that the reproduction number was constant but would like to allow it to change on the 16th of March 2020 we would define a new case data set using + +```{r bp_define} +bp_cases <- data.table::copy(reported_cases) +bp_cases <- bp_cases[, + breakpoint := ifelse(date == as.Date("2020-03-16"), 1, 0) +] +``` + +We then use this instead of `reported_cases` in the `estimate_infections()` function: + +```{r bp} +bkp <- estimate_infections(bp_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior), + gp = NULL +) +# summarise results +summary(bkp) +# elapsed time (in seconds) +get_elapsed_time(bkp$fit) +# summary plot +plot(bkp) +``` + +## Weekly random walk + +Instead of a smooth Gaussian Process we might want the reproduction number to change step-wise, e.g. every week. +This can be achieved using the `rw` option which defines the length of the time step in a random walk that the reproduction number is assumed to follow. + +```{r weekly_rw} +rw <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior, rw = 7), + gp = NULL +) +# summarise results +summary(rw) +# elapsed time (in seconds) +get_elapsed_time(rw$fit) +# summary plot +plot(rw) +``` + +## No delays + +Whilst _EpiNow2_ allows the user to specify delays, it can also run directly on the data as does e.g. the [EpiEstim](https://CRAN.R-project.org/package=EpiEstim) package. + +```{r no_delays} +no_delay <- estimate_infections( + reported_cases, + generation_time = generation_time_opts(generation_time) +) +# summarise results +summary(no_delay) +# elapsed time (in seconds) +get_elapsed_time(no_delay$fit) +# summary plot +plot(no_delay) +``` + +## Non-parametric infection model + +The package also includes a non-parametric infection model. +This runs much faster but does not use the renewal equation to generate infections. +Because of this none of the options defining the behaviour of the reproduction number are available in this case, limiting user choice and model generality. +It also means that the model is questionable for forecasting, which is why were here set the predictive horizon to 0. + +```{r nonparametric} +non_parametric <- estimate_infections(reported_cases, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = NULL, + backcalc = backcalc_opts(), + horizon = 0 +) +# summarise results +summary(non_parametric) +# elapsed time (in seconds) +get_elapsed_time(non_parametric$fit) +# summary plot +plot(non_parametric) +``` diff --git a/vignettes/estimate_infections_workflow.Rmd b/vignettes/estimate_infections_workflow.Rmd new file mode 100644 index 000000000..996699d97 --- /dev/null +++ b/vignettes/estimate_infections_workflow.Rmd @@ -0,0 +1,260 @@ +--- +title: "Workflow for Rt estimation and forecasting" +output: + rmarkdown::html_vignette: + toc: true + number_sections: true +bibliography: library.bib +csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +vignette: > + %\VignetteIndexEntry{Workflow for Rt estimation and forecasting} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + + + +This vignette describes the typical workflow for estimating reproduction numbers and performing short-term forecasts for a disease spreading in a given setting using _EpiNow2_. +The vignette uses the default non-stationary Gaussian process model included in the package. +See other vignettes for a more thorough exploration of [alternative model variants](estimate_infections_options.html) and [theoretical background](estimate_infections.html). + +# Data + +Obtaining a good and full understanding of the data being used is an important first step in any inference procedure such as the one applied here. +_EpiNow2_ expects data in the format of a data frame with two columns, `date` and `confirm`, where `confirm` stands for the number of confirmed counts - although in reality this can be applied to any data including suspected cases and lab-confirmed outcomes. +The user might already have the data as such a time series provided, for example, on public dashboards or directly from public health authorities. +Alternatively, they can be constructed from individual-level data, for example using the [incidence2](https://cran.r-project.org/web/packages/incidence2/index.html) R package. +An example data set called `example_confirm` is included in the package: + + +```r +head(example_confirmed) +#> date confirm +#> 1: 2020-02-22 14 +#> 2: 2020-02-23 62 +#> 3: 2020-02-24 53 +#> 4: 2020-02-25 97 +#> 5: 2020-02-26 93 +#> 6: 2020-02-27 78 +``` + +Any estimation procedure is only as good as the data that feeds into it. +A thorough understanding of the data that is used for _EpiNow2_ and its limitations is a prerequisite for its use. +This includes but is not limited to biases in the population groups that are represented (_EpiNow2_ assumes a closed population with all infections being caused by other infections in the same population), reporting artefacts and delays, and completeness of reporting. +Some of these can be mitigated using the routines available in _EpiNow2_ as described below, but others will cause biases in the results and need to be carefully considered when interpreting the results. + +# Set up + +We first load the _EpiNow2_ package. + + +```r +library("EpiNow2") +``` + +We then set the number of cores to use. We will want to run 4 MCMC chains in parallel so we set this to 4. + + +```r +options(mc.cores = 4) +``` + +If we had fewer than 4 available or wanted to run fewer than 4 chains (at the expense of some robustness), or had fewer than 4 computing cores available we could set it to that. To find out the number of cores available one can use the [detectCores](https://rdrr.io/r/parallel/detectCores.html) function from the `parallel` package. + +# Parameters + +Once a data set has been identified, a number of relevant parameters need to be considered before using _EpiNow2_. +As these will affect any results, it is worth spending some time investigating what their values should be. + +## Delay distributions + +_EpiNow2_ works with different delays that apply to different parts of the infection and observation process. +They are defined using a common interface with the `dist_spec()` function. +For help with this function, see its manual page + + +```r +?EpiNow2::dist_spec +``` + +In all cases, the distributions given can be *fixed* (i.e. have no uncertainty) or *variable* (i.e. have associated uncertainty). +For example, to define a fixed gamma distribution with mean 3, standard deviation (sd) 1 and maximum value 10, you would write + + +```r +dist_spec(mean = 3, sd = 1, distribution = "gamma", max = 10) +#> +#> Fixed distribution with PMF [0.0038 0.15 0.39 0.3 0.12 0.03 0.006 0.00096 0.00013 1.6e-05] +``` + +If distributions are variable, the values with uncertainty are treated as [prior probability densities](https://en.wikipedia.org/wiki/Prior_probability) in the Bayesian inference framework used by _EpiNow2_, i.e. they are estimated as part of the inference. +For example, to define a variable gamma distribution where uncertainty in the mean is given by a normal distribution with mean 3 and sd 2, and uncertainty in the standard deviation is given by a normal distribution with mean 1 and sd 0.1, with a maximum value 10, you would write + + +```r +dist_spec( + mean = 3, mean_sd = 2, sd = 1, sd_sd = 0.1, distribution = "gamma", max = 10 +) +#> +#> Uncertain gamma distribution with (untruncated) mean 3 (SD 2) and SD 1 (SD 0.1) +``` + +There are various ways the specific delay distributions mentioned below might be obtained. +Often, they will come directly from the existing literature reviewed by the user and studies conducted elsewhere. +Sometimes it might be possible to obtain them from existing databases, e.g. using the [epiparameter](https://github.com/epiverse-trace/epiparameter) R package. +Alternatively they might be obtainable from raw data, e.g. linelists. +The _EpiNow2_ package contains functionality for estimating delay distributions from observed delays in the `estimate_delays()` function. +For a more comprehensive treatment of delays and their estimation avoiding common biases one can consider, for example, the [dynamicaltruncation](https://github.com/parksw3/epidist-paper) R package and associated paper. + +### Generation intervals + +The generation interval is a delay distribution that describes the amount of time that passes between an individual becoming infected and infecting someone else. +In _EpiNow2_, the generation time distribution is defined by a call to `generation_time_opts()`, a function that takes a single argument defined as a `dist_spec` object (returned by `dist_spec()`). +For example, to define the generation time as gamma distributed with uncertain mean centered on 3 (sd: 2) and sd centered on 1 (sd: 0.1), a maximum value of 10 and weighted by the number of case data points we would use + + +```r +generation_time <- dist_spec( + mean = 3, mean_sd = 2, sd = 1, sd_sd = 0.1, distribution = "gamma", max = 10 +) +generation_time_opts(generation_time) +``` + +### Reporting delays + +_EpiNow2_ calculates reproduction numbers based on the trajectory of infection incidence. +Usually this is not observed directly. +Instead, we calculate case counts based on, for example, onset of symptoms, lab confirmations, hospitalisations, etc. +In order to estimate the trajectory of infection incidence from this we need to either know or estimate the distribution of delays from infection to count. +Often, such counts are composed of multiple delays for which we only have separate information, for example the incubation period (time from infection to symptom onset) and reporting delay (time from symptom onset to being a case in the data, e.g. via lab confirmation, if counts are not by the date of symptom onset). +In this case, we can combine multiple delays defined using `dist_spec()` with the plus (`+`) operator, e.g. + + +```r +incubation_period <- dist_spec( + mean = 1.6, mean_sd = 0.05, sd = 0.5, sd_sd = 0.05, + distribution = "lognormal", max = 14 +) +reporting_delay <- dist_spec( + mean = 0.5, sd = 0.5, distribution = "lognormal", max = 10 +) +incubation_period + reporting_delay +#> +#> Combination of delay distributions: +#> Uncertain lognormal distribution with (untruncated) logmean 1.6 (SD 0.05) and logSD 0.5 (SD 0.05) +#> Fixed distribution with PMF [0.16 0.49 0.23 0.077 0.025 0.0084 0.003 0.0011 0.00045 0.00019] +``` + +In _EpiNow2_, the reporting delay distribution is defined by a call to `delay_opts()`, a function that takes a single argument defined as a `dist_spec` object (returned by `dist_spec()`). +For example, if our observations were by symptom onset we would use + + +```r +delay_opts(incubation_period) +``` + +If they were by date of lab confirmation that happens with a delay given by `reporting_delay`, we would use + + +```r +delay <- incubation_period + reporting_delay +delay_opts(delay) +``` + +### Truncation + +Besides the delay from infection to the event that is recorded in the data, there can also be a delay from that event to being recorded in the data. +For example, data reported by symptom onset may only become part of the dataset once lab confirmation has occurred, or even a day or two after that confirmation. +Statistically, this means our data is right-truncated. +In practice, it means that recent data will be unlikely to be complete. + +The amount of such truncation that exists in the data can be estimated from multiple snapshots of the data, i.e. what the data looked like at multiple past dates. +One can then use methods that use the amount of backfilling that occurred 1, 2, ... days after data for a date are first reported. +In _EpiNow2_, this can be done using the `estimate_truncation()` method which returns, amongst others, posterior estimates of the truncation distribution. +For more details on the model used for this, see the [estimate_truncation](estimate_truncation.html) vignette. + + +```r +?estimate_truncation +``` + +In the `estimate_infections()` function, the truncation distribution is defined by a call to `trunc_opts()`, a function that takes a single argument defined as a `dist_spec` (either defined by the user or obtained from a call to `estimate_truncation()` or any other method for estimating right truncation). +This will then be used to correct for right truncation in the data. + +The separation of estimation of right truncation on the one hand and estimation of the reproduction number on the other may be attractive for practical purposes but is questionable statistically as it separates two processes that are not strictly separable, potentially introducing a bias. +An alternative approach where these are estimated jointly is being implemented in the [epinowcast](https://package.epinowcast.org/) package, which is being developed by the _EpiNow2_ developers with collaborators. + +## Completeness of reporting + +Another issue affecting the progression from infections to reported outcomes is underreporting, i.e. the fact that not all infections are reported as cases. +This varies both by pathogen and population (and e.g. the proportion of infections that are asymptomatic) as well as the specific outcome used as data and where it is located on the severity pyramid (e.g. hospitalisations vs. community cases). +In _EpiNow2_ we can specify the proportion of infections that we expect to be observed (with uncertainty assumed represented by a truncated normal distribution with bounds at 0 and 1) using the `scale` argument to the `obs_opts()` function. +For example, if we think that 40% (with standard deviation 1%) of infections end up in the data as observations we could specify. + + +```r +obs_scale <- list(mean = 0.4, sd = 0.01) +obs_opts(scale = obs_scale) +``` + +## Initial reproduction number + +The default model that `estimate_infections()` uses to estimate reproduction numbers requires specification of a prior probability distribution for the initial reproduction number. +This represents the user's initial belief of the value of the reproduction number, where there is no data yet to inform its value. +By default this is assumed to be represented by a lognormal distribution with mean and standard deviation of 1. +It can be changed using the `rt_opts()` function. +For example, if the user believes that at the very start of the data the reproduction number was 2, with uncertainty in this belief represented by a standard deviation of 1, they would use + + +```r +rt_prior <- list(mean = 2, sd = 1) +rt_opts(prior = rt_prior) +``` + +## Weighing delay priors + +When providing uncertain delay distributions one can end up in a situation where the estimated means are shifted a long way from the given distribution means, and possibly further than is deemed realistic by the user. +In that case, one could specify narrower prior distributions (e.g., smaller `mean_sd`) in order to keep the estimated means closer to the given mean, but this can be difficult to do in a principled manner in practice. +As a more straightforward alternative, one can choose to weigh the generation time priors by the number of data points in the case data set by setting `weigh_delay_priors = TRUE` (the default). + +# Estimation and forecasting + +All the options are combined in a call to the `estimate_infections()` function. +For example, using some of the options described above one could call + + +```r +def <- estimate_infections( + example_confirmed, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior) +) +#> Warning: There were 3 divergent transitions after warmup. See +#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup +#> to find out why this is a problem and how to eliminate them. +#> Warning: Examine the pairs() plot to diagnose sampling problems +``` + +Alternatively, for production environments, we recommend using the `epinow()` function. It uses `estimate_infections()` internally and provides functionality for logging and saving results and plots in dedicated directories in the user's file system. + +## Forecasting secondary outcomes + +The `estimate_infections()` function works with a single time series of outcomes such as cases by symptom onset or hospitalisations. +Sometimes one wants to further create forecasts of other secondary outcomes such as deaths. +The package contains functionality to estimate the delay and scaling between multiple time series with the `estimate_secondary()` function, as well as for using this to make forecasts with the `forecast_secondary()` function. + +# Interpretation + +To visualise the results one can use the `plot()` function that comes with the package + + +```r +plot(def) +``` + +![plot of chunk results](figure/results-1.png) + +The results returned by the `estimate_infections` model depend on the values assigned to all to parameters discussed in this vignette, i.e. delays, scaling, and reproduction numbers, as well as the model variant used and its parameters. +Any interpretation of the results will therefore need to bear these in mind, as well as any properties of the data and/or the subpopulations that it represents. +See the [Model options](estimate_infections_options.html) vignette for an illustration of the impact of model choice. diff --git a/vignettes/estimate_infections_workflow.Rmd.orig b/vignettes/estimate_infections_workflow.Rmd.orig new file mode 100644 index 000000000..93b137a82 --- /dev/null +++ b/vignettes/estimate_infections_workflow.Rmd.orig @@ -0,0 +1,230 @@ +--- +title: "Workflow for Rt estimation and forecasting" +output: + rmarkdown::html_vignette: + toc: true + number_sections: true +bibliography: library.bib +csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +vignette: > + %\VignetteIndexEntry{Workflow for Rt estimation and forecasting} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +library("EpiNow2") +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This vignette describes the typical workflow for estimating reproduction numbers and performing short-term forecasts for a disease spreading in a given setting using _EpiNow2_. +The vignette uses the default non-stationary Gaussian process model included in the package. +See other vignettes for a more thorough exploration of [alternative model variants](estimate_infections_options.html) and [theoretical background](estimate_infections.html). + +# Data + +Obtaining a good and full understanding of the data being used is an important first step in any inference procedure such as the one applied here. +_EpiNow2_ expects data in the format of a data frame with two columns, `date` and `confirm`, where `confirm` stands for the number of confirmed counts - although in reality this can be applied to any data including suspected cases and lab-confirmed outcomes. +The user might already have the data as such a time series provided, for example, on public dashboards or directly from public health authorities. +Alternatively, they can be constructed from individual-level data, for example using the [incidence2](https://cran.r-project.org/web/packages/incidence2/index.html) R package. +An example data set called `example_confirm` is included in the package: + +```{r} +head(example_confirmed) +``` + +Any estimation procedure is only as good as the data that feeds into it. +A thorough understanding of the data that is used for _EpiNow2_ and its limitations is a prerequisite for its use. +This includes but is not limited to biases in the population groups that are represented (_EpiNow2_ assumes a closed population with all infections being caused by other infections in the same population), reporting artefacts and delays, and completeness of reporting. +Some of these can be mitigated using the routines available in _EpiNow2_ as described below, but others will cause biases in the results and need to be carefully considered when interpreting the results. + +# Set up + +We first load the _EpiNow2_ package. + +```{r packages} +library("EpiNow2") +``` + +We then set the number of cores to use. We will want to run 4 MCMC chains in parallel so we set this to 4. + +```{r} +options(mc.cores = 4) +``` + +If we had fewer than 4 available or wanted to run fewer than 4 chains (at the expense of some robustness), or had fewer than 4 computing cores available we could set it to that. To find out the number of cores available one can use the [detectCores](https://rdrr.io/r/parallel/detectCores.html) function from the `parallel` package. + +# Parameters + +Once a data set has been identified, a number of relevant parameters need to be considered before using _EpiNow2_. +As these will affect any results, it is worth spending some time investigating what their values should be. + +## Delay distributions + +_EpiNow2_ works with different delays that apply to different parts of the infection and observation process. +They are defined using a common interface with the `dist_spec()` function. +For help with this function, see its manual page + +```{r eval = FALSE} +?EpiNow2::dist_spec +``` + +In all cases, the distributions given can be *fixed* (i.e. have no uncertainty) or *variable* (i.e. have associated uncertainty). +For example, to define a fixed gamma distribution with mean 3, standard deviation (sd) 1 and maximum value 10, you would write + +```{r} +dist_spec(mean = 3, sd = 1, distribution = "gamma", max = 10) +``` + +If distributions are variable, the values with uncertainty are treated as [prior probability densities](https://en.wikipedia.org/wiki/Prior_probability) in the Bayesian inference framework used by _EpiNow2_, i.e. they are estimated as part of the inference. +For example, to define a variable gamma distribution where uncertainty in the mean is given by a normal distribution with mean 3 and sd 2, and uncertainty in the standard deviation is given by a normal distribution with mean 1 and sd 0.1, with a maximum value 10, you would write + +```{r} +dist_spec( + mean = 3, mean_sd = 2, sd = 1, sd_sd = 0.1, distribution = "gamma", max = 10 +) +``` + +There are various ways the specific delay distributions mentioned below might be obtained. +Often, they will come directly from the existing literature reviewed by the user and studies conducted elsewhere. +Sometimes it might be possible to obtain them from existing databases, e.g. using the [epiparameter](https://github.com/epiverse-trace/epiparameter) R package. +Alternatively they might be obtainable from raw data, e.g. linelists. +The _EpiNow2_ package contains functionality for estimating delay distributions from observed delays in the `estimate_delays()` function. +For a more comprehensive treatment of delays and their estimation avoiding common biases one can consider, for example, the [dynamicaltruncation](https://github.com/parksw3/epidist-paper) R package and associated paper. + +### Generation intervals + +The generation interval is a delay distribution that describes the amount of time that passes between an individual becoming infected and infecting someone else. +In _EpiNow2_, the generation time distribution is defined by a call to `generation_time_opts()`, a function that takes a single argument defined as a `dist_spec` object (returned by `dist_spec()`). +For example, to define the generation time as gamma distributed with uncertain mean centered on 3 (sd: 2) and sd centered on 1 (sd: 0.1), a maximum value of 10 and weighted by the number of case data points we would use + +```{r, results = 'hide'} +generation_time <- dist_spec( + mean = 3, mean_sd = 2, sd = 1, sd_sd = 0.1, distribution = "gamma", max = 10 +) +generation_time_opts(generation_time) +``` + +### Reporting delays + +_EpiNow2_ calculates reproduction numbers based on the trajectory of infection incidence. +Usually this is not observed directly. +Instead, we calculate case counts based on, for example, onset of symptoms, lab confirmations, hospitalisations, etc. +In order to estimate the trajectory of infection incidence from this we need to either know or estimate the distribution of delays from infection to count. +Often, such counts are composed of multiple delays for which we only have separate information, for example the incubation period (time from infection to symptom onset) and reporting delay (time from symptom onset to being a case in the data, e.g. via lab confirmation, if counts are not by the date of symptom onset). +In this case, we can combine multiple delays defined using `dist_spec()` with the plus (`+`) operator, e.g. + +```{r} +incubation_period <- dist_spec( + mean = 1.6, mean_sd = 0.05, sd = 0.5, sd_sd = 0.05, + distribution = "lognormal", max = 14 +) +reporting_delay <- dist_spec( + mean = 0.5, sd = 0.5, distribution = "lognormal", max = 10 +) +incubation_period + reporting_delay +``` + +In _EpiNow2_, the reporting delay distribution is defined by a call to `delay_opts()`, a function that takes a single argument defined as a `dist_spec` object (returned by `dist_spec()`). +For example, if our observations were by symptom onset we would use + +```{r, results = 'hide'} +delay_opts(incubation_period) +``` + +If they were by date of lab confirmation that happens with a delay given by `reporting_delay`, we would use + +```{r, results = 'hide'} +delay <- incubation_period + reporting_delay +delay_opts(delay) +``` + +### Truncation + +Besides the delay from infection to the event that is recorded in the data, there can also be a delay from that event to being recorded in the data. +For example, data reported by symptom onset may only become part of the dataset once lab confirmation has occurred, or even a day or two after that confirmation. +Statistically, this means our data is right-truncated. +In practice, it means that recent data will be unlikely to be complete. + +The amount of such truncation that exists in the data can be estimated from multiple snapshots of the data, i.e. what the data looked like at multiple past dates. +One can then use methods that use the amount of backfilling that occurred 1, 2, ... days after data for a date are first reported. +In _EpiNow2_, this can be done using the `estimate_truncation()` method which returns, amongst others, posterior estimates of the truncation distribution. +For more details on the model used for this, see the [estimate_truncation](estimate_truncation.html) vignette. + +```{r eval = FALSE} +?estimate_truncation +``` + +In the `estimate_infections()` function, the truncation distribution is defined by a call to `trunc_opts()`, a function that takes a single argument defined as a `dist_spec` (either defined by the user or obtained from a call to `estimate_truncation()` or any other method for estimating right truncation). +This will then be used to correct for right truncation in the data. + +The separation of estimation of right truncation on the one hand and estimation of the reproduction number on the other may be attractive for practical purposes but is questionable statistically as it separates two processes that are not strictly separable, potentially introducing a bias. +An alternative approach where these are estimated jointly is being implemented in the [epinowcast](https://package.epinowcast.org/) package, which is being developed by the _EpiNow2_ developers with collaborators. + +## Completeness of reporting + +Another issue affecting the progression from infections to reported outcomes is underreporting, i.e. the fact that not all infections are reported as cases. +This varies both by pathogen and population (and e.g. the proportion of infections that are asymptomatic) as well as the specific outcome used as data and where it is located on the severity pyramid (e.g. hospitalisations vs. community cases). +In _EpiNow2_ we can specify the proportion of infections that we expect to be observed (with uncertainty assumed represented by a truncated normal distribution with bounds at 0 and 1) using the `scale` argument to the `obs_opts()` function. +For example, if we think that 40% (with standard deviation 1%) of infections end up in the data as observations we could specify. + +```{r results = 'hide'} +obs_scale <- list(mean = 0.4, sd = 0.01) +obs_opts(scale = obs_scale) +``` + +## Initial reproduction number + +The default model that `estimate_infections()` uses to estimate reproduction numbers requires specification of a prior probability distribution for the initial reproduction number. +This represents the user's initial belief of the value of the reproduction number, where there is no data yet to inform its value. +By default this is assumed to be represented by a lognormal distribution with mean and standard deviation of 1. +It can be changed using the `rt_opts()` function. +For example, if the user believes that at the very start of the data the reproduction number was 2, with uncertainty in this belief represented by a standard deviation of 1, they would use + +```{r results = 'hide'} +rt_prior <- list(mean = 2, sd = 1) +rt_opts(prior = rt_prior) +``` + +## Weighing delay priors + +When providing uncertain delay distributions one can end up in a situation where the estimated means are shifted a long way from the given distribution means, and possibly further than is deemed realistic by the user. +In that case, one could specify narrower prior distributions (e.g., smaller `mean_sd`) in order to keep the estimated means closer to the given mean, but this can be difficult to do in a principled manner in practice. +As a more straightforward alternative, one can choose to weigh the generation time priors by the number of data points in the case data set by setting `weigh_delay_priors = TRUE` (the default). + +# Estimation and forecasting + +All the options are combined in a call to the `estimate_infections()` function. +For example, using some of the options described above one could call + +```{r} +def <- estimate_infections( + example_confirmed, + generation_time = generation_time_opts(generation_time), + delays = delay_opts(delay), + rt = rt_opts(prior = rt_prior) +) +``` + +Alternatively, for production environments, we recommend using the `epinow()` function. It uses `estimate_infections()` internally and provides functionality for logging and saving results and plots in dedicated directories in the user's file system. + +## Forecasting secondary outcomes + +The `estimate_infections()` function works with a single time series of outcomes such as cases by symptom onset or hospitalisations. +Sometimes one wants to further create forecasts of other secondary outcomes such as deaths. +The package contains functionality to estimate the delay and scaling between multiple time series with the `estimate_secondary()` function, as well as for using this to make forecasts with the `forecast_secondary()` function. + +# Interpretation + +To visualise the results one can use the `plot()` function that comes with the package + +```{r results} +plot(def) +``` + +The results returned by the `estimate_infections` model depend on the values assigned to all to parameters discussed in this vignette, i.e. delays, scaling, and reproduction numbers, as well as the model variant used and its parameters. +Any interpretation of the results will therefore need to bear these in mind, as well as any properties of the data and/or the subpopulations that it represents. +See the [Model options](estimate_infections_options.html) vignette for an illustration of the impact of model choice. diff --git a/vignettes/figure/bp-1.png b/vignettes/figure/bp-1.png new file mode 100644 index 000000000..3fd702fba Binary files /dev/null and b/vignettes/figure/bp-1.png differ diff --git a/vignettes/figure/data-1.png b/vignettes/figure/data-1.png new file mode 100644 index 000000000..3a94d6db3 Binary files /dev/null and b/vignettes/figure/data-1.png differ diff --git a/vignettes/figure/default-1.png b/vignettes/figure/default-1.png new file mode 100644 index 000000000..82686601a Binary files /dev/null and b/vignettes/figure/default-1.png differ diff --git a/vignettes/figure/fixed-1.png b/vignettes/figure/fixed-1.png new file mode 100644 index 000000000..75fa86fa4 Binary files /dev/null and b/vignettes/figure/fixed-1.png differ diff --git a/vignettes/figure/gp_projection-1.png b/vignettes/figure/gp_projection-1.png new file mode 100644 index 000000000..99c82de01 Binary files /dev/null and b/vignettes/figure/gp_projection-1.png differ diff --git a/vignettes/figure/lower_accuracy-1.png b/vignettes/figure/lower_accuracy-1.png new file mode 100644 index 000000000..82407a8bc Binary files /dev/null and b/vignettes/figure/lower_accuracy-1.png differ diff --git a/vignettes/figure/no_delays-1.png b/vignettes/figure/no_delays-1.png new file mode 100644 index 000000000..affd0032f Binary files /dev/null and b/vignettes/figure/no_delays-1.png differ diff --git a/vignettes/figure/nonparametric-1.png b/vignettes/figure/nonparametric-1.png new file mode 100644 index 000000000..601709c09 Binary files /dev/null and b/vignettes/figure/nonparametric-1.png differ diff --git a/vignettes/figure/regional_epinow-1.png b/vignettes/figure/regional_epinow-1.png new file mode 100644 index 000000000..e8cb4fdfd Binary files /dev/null and b/vignettes/figure/regional_epinow-1.png differ diff --git a/vignettes/figure/regional_epinow_multiple-1.png b/vignettes/figure/regional_epinow_multiple-1.png new file mode 100644 index 000000000..127a672a2 Binary files /dev/null and b/vignettes/figure/regional_epinow_multiple-1.png differ diff --git a/vignettes/figure/susceptible_depletion-1.png b/vignettes/figure/susceptible_depletion-1.png new file mode 100644 index 000000000..9c4691a5f Binary files /dev/null and b/vignettes/figure/susceptible_depletion-1.png differ diff --git a/vignettes/figure/truncation-1.png b/vignettes/figure/truncation-1.png new file mode 100644 index 000000000..f9dd5d845 Binary files /dev/null and b/vignettes/figure/truncation-1.png differ diff --git a/vignettes/figure/weekly_rw-1.png b/vignettes/figure/weekly_rw-1.png new file mode 100644 index 000000000..2d4387208 Binary files /dev/null and b/vignettes/figure/weekly_rw-1.png differ diff --git a/vignettes/precompile.R b/vignettes/precompile.R new file mode 100644 index 000000000..94792a23f --- /dev/null +++ b/vignettes/precompile.R @@ -0,0 +1,15 @@ +# Vignettes that have long run times + +library("knitr") +knit( + "estimate_infections_options.Rmd.orig", + "estimate_infections_options.Rmd" +) +knit( + "estimate_infections_workflow.Rmd.orig", + "estimate_infections_workflow.Rmd" +) +knit( + "epinow.Rmd.orig", + "epinow.Rmd" +)