Replies: 3 comments
-
Really excited to see this coming together. Just a few remaining thoughts/questions, summarized from some conversations with colleagues:
|
Beta Was this translation helpful? Give feedback.
-
Hey y'all, Thank you so much for putting these guidelines together and all of your hard work organizing this new forecasting effort! We're excited to take part and just had a few comments/questions as I read through the guidelines:
Thank you so much for all of your hard work putting this together! |
Beta Was this translation helpful? Give feedback.
-
Above, we list two data sets that would be maintained by the hub:
I have three comments on these:
Open to other ideas, but we will want something that allows us to track the variable-length list of modeled variants for each reference date. |
Beta Was this translation helpful? Give feedback.
-
Introduction and background
Since fall 2023, a group of researchers from the US CDC Center of Forecasting and Outbreak Analytics (CFA) and the Reich Lab at UMass Amherst, in consultation with folks from the NextStrain project, have developed a plan to launch the SARS-CoV-2 Variant Nowcast Hub.
Collaborative and open forecast hubs have emerged as a valuable way to centralize and coordinate predictive modeling efforts for public health. In realms where multiple teams are tackling the same problem using different data inputs and/or modeling methodologies, a hub can standardize targets in ways that facilitate model comparison and the integration of outputs from multiple models into public health practice.
While SARS-CoV-2 variant dynamics received most attention from the scientific community in 2021 and 2022, SARS-CoV-2 genomic sequences continue to be generated, and trends in which variants are predominating will continue to impact transmission across the US and the world. From a modeling perspective, there is less consensus about a standard way to represent model outputs for multivariate variant predictions than there is for other outcomes. Therefore, a key reason we want to build this nowcast hub is to help learn about the right way to build this kind of collaborative modeling effort, potentially not just for SARS-CoV-2 but also for other rapidly evolving pathogens.
This post will summarize some of the thinking to-date about the set-up of the repository, with the goal being to let people know publicly about the effort and solicit some feedback about different choices that are being made while changes can still be easily made. Comments and further discussion on all items about this proposed hub are welcome.
Timeline
Discussions and planning for this hub have been ongoing since fall 2023, and we have a target launch date of fall 2024.
What will modelers be asked to predict?
Generally, we plan to solicit predictions of frequencies of the predominant lineages or clades in the United States, at a daily timescale and the geographic resolution of states and territories (Washington DC and Puerto Rico) in the US. Details about these choices follow in subsections below. The Hub will solicit predictions of frequencies (i.e., numbers between 0 and 1) associated with each clade or group of clades, for a particular location and a particular day.
Submission deadlines
We are planning for submissions to be due at 8pm ET every Wednesday. This time was chosen to give modelers time in the beginning of the week to run and adjust models and stakeholders time at the end of the week to incorporate preliminary results into discussions or decision making.
Processed datasets, selecting variant clades
Each week, several days prior to the submission deadline, the hub administrators will provide two datasets
To create the clade count file (number 1 in the list above), hub administrators will oversee the creation of an aggregated version of the NextStrain USA sequence count file based on US data from GenBank. The NextStrain file is typically updated daily in the late evening US eastern time (it is only updated when new data are available). The Variant Nowcast Hub version of the file will pull the most recent version of the file when the workflow runs each week. The precise lineage assignment model (sometimes referred to as a “reference tree”) that was used as well as the version of raw sequence data will also be stored as metadata, to facilitate reproducibility and evaluation. Versions of these target data files will be maintained by the hub, so modeling teams could come back and retrospectively test their models on data as they were available in real time.
To create the file that lists which clades will be accepted for a given week (file number 2 in the list above), in each submission week we will select up to ten NextStrain clades from among those that had a reported prevalence of at least 1% across the US in any of the past three MMWR weeks. Any clades with prevalence of less than 1% will be grouped into an “other” category for which predictions of combined prevalence will also be collected. No more than 10 clades (including “other”) will be selected in a given week.
Prediction horizon
Genomic sequences tend to be reported weeks after being collected. Therefore, recent data is subject to quite a lot of backfill. For this reason, we will collect "nowcasts" (predictions for data relevant to times prior to the current time, but not yet observed) and some "forecasts" (predictions for future observations). Counting the Wednesday submission date as a prediction horizon of zero, we will collect daily-level predictions for 10 days into the future (the Saturday that ends the epidemic week after the Wednesday submission) and -31 days into the past (the Sunday that starts the epidemic week four weeks prior to the Wednesday submission date). Overall, six weeks (42 days) of predicted values will be solicited each week.
Model output format
This hub will follow (hubverse)[https://hubverse.io/en/latest/] data standards. Submissions will be asked to contain either or both of mean and sample-based model outputs. We use the term “model task” below to refer to a prediction for a specific clade, location and horizon. For example, if mean model outputs are submitted, there will be one value between 0 and 1 for each model task. The submitted values for all clades must sum to 1 for a given state and horizon. As we will describe in further detail below, the target for prediction is the proportion of circulating viral genomes for a given location and target date where the patient is infected with a specified clade of the SARS-CoV-2 virus.
To submit probabilistic predictions, a (sample format)[https://hubverse.io/en/latest/user-guide/sample-output-type.html] will be used to encode samples from the predictive distribution for each model task. We will require exactly 100 samples for each model task. One key advantage to submitting sample-based output is that dependence can be encoded across horizons (corresponding to trajectories of variant prevalence over time), or even across locations (see details in Hubverse sample model-output specifications). For this Variant Nowcast Hub, we will require that samples be submitted in such a way as to imply that they are structured into trajectories across clades and horizons. (See following section for how variants will be classified into clade categories.) This means that
a) at each location and horizon a common sample ID (in the ouput_type_id column) will help us ensure that the clade proportions sum to 1, and
b) for each location and clade, common sample IDs across horizons will allow us to draw trajectories by clade.
This specification corresponds to a hubverse-style “compound modeling task” that includes the following fields: "reference_date", "location". Samples then capture dependence across the complementary set of task ids: “horizon”, “clade”.
We note that sample IDs present in the output_type_id column of submissions are not necessarily inherent properties of how the samples are generated, as they can be changed post-hoc by a modeler. For example, some models may make nowcasts independently by horizon but the samples could be tied together either randomly or via some other correlation structure or secondary model to assign sample IDs that are consistent across horizons. As another example, some models may make forecasts that have joint dependence structure across locations as well as horizons. Sample IDs could be shared across locations as well, but this is not required for the submission to pass validation.
While the hub will collect predictive means, to be included in the hub ensemble model, samples must be submitted and the mean forecast for the hub ensemble will be obtained as a summary of sample predictions.
Model evaluation challenges
Several features of these data in particular make evaluations tricky.
Data for some model tasks may be partially observed at the time nowcasts and forecasts are made. The Hub wants to encourage teams to submit predictions of “true” underlying clade probabilities that will vary more or less smoothly, if sometimes steeply, over time. When some observations are partially observed at the time of nowcast submissions, it could be to the modeler’s advantage to predict a value that is close to the frequency observed at the time the forecast is made, thus deviating from the underlying (perhaps smooth) function the model would predict in the absence of data. To incentivize “honest” nowcasts that do not shift predictions for time-points with partial observations, we will only evaluate locations and dates for which no data have yet been reported at the time the processed dataset is created (see Processed datasets section below). One implication of this decision is that different numbers of days may be evaluated for some locations when compared with others.
The reference phylogenetic tree that defines clades changes over time. Nowcasts and forecasts will be evaluated against whatever sequence data is available 90 days after the deadline that a set of predictions were submitted for. Additionally, those sequences will be assigned a clade based on the reference tree that was used to generate the data immediately prior to the submission date. This means that new clades that emerge in the time since the predictions were made will still be classified as they would have been when predictions were made.
The variance in the eventually observed clade counts depends on the eventual sample size, or number of sequences tested on a particular day. With a large number of sequences, the variance would tend to be larger and with a small number of sequences the variance would be smaller. However, the number of sequences itself is not of particular epidemiological interest. The evaluation plan introduced below evaluates the counts assuming they follow a multinomial distribution with sample size equal to the number of samples for the target date and location that have been reported as of the evaluation date, so as to eliminate the nuisance parameter of the count variance.
Model evaluation
We will collect nowcasts for$\theta$ , a $K$ -vector, where $K$ is the number of clades we are interested in, and whose $k$th element, $\theta_k$ , is the true proportion of all current SARS-CoV-2 infections which are clade $k$ . We observe $C = (C_1, … , C_K)$ , the vector of observed counts for each of the $K$ clades of interest for a particular location and target date, and let $N = \sum_k C_k$ be the total number of sequences collected for that date and location. Variation in $C$ depends on the total number of sequenced samples, $N$ . Thus, accurate nowcasts of the observed $C$ , would require teams to model and forecast $N$ , which is not of epidemiological interest.
To avoid a situation where the distribution of the prediction target depends on$N$ , nowcasts are to be submitted in the form of 100 samples $\hat \theta^{(1)}, …, \hat \theta^{(100)}$ from the predictive distribution for $\theta$ . Historical data show that 90 days is sufficient time for nearly all sequences to be tested and reported and therefore for $C$ to represent a stable estimate of relative clade prevalences. Therefore, 90 days after each submission date, the Hub will use the total number of observed samples, $N$ , and the clade proportion nowcasts $\hat \theta^{(1)}, …, \hat \theta^{(100)}$ to generate nowcasts for observed clade counts, $\hat C$ , by sampling from multinomial distributions. Specifically, the hub will generate predictions for observed clade counts $\hat C^{(1)}, …, \hat C^{(100)}$ where each $\hat C^{(s)}$ is drawn from a $Multinomial(N, \hat \theta^{(s)})$ distribution.
The use of a multinomial distribution assumes that, conditional on the mean prevalence, clade assignments for the sequenced samples are independent and have probability of being in each clade equal to the population probabilities$\theta$ . Furthermore, while the use of a multinomial distribution with size $N$ gets around the need for teams to model the number of sequences at a given time, it also introduces a specific assumption about the variation in the observation process that takes probabilities and a size $N$ and turns them into cases. If a team does not believe these assumptions, they may wish to modify their distribution for $\theta$ accordingly. For example, if a team believes that an overdispersed Dirichlet-Multinomial distribution would more accurately model the variation in future observations, they should add dispersion to their distribution for $\theta$ . Or if a team believes that sampling is biased and some clades are underrepresented in the reported data, they may wish to modify their estimate of $\theta$ to reflect the reporting process.
These count forecasts$\hat C^{(1)}, …, \hat C^{(100)}$ will be scored on the observed counts $C$ , using the energy score [Gneiting et al. 2008, Jordan et al. 2019], a proper scoring rule for multivariate data, which uses samples from the forecast distribution to compute scores. We note that the energy score procedure described above scores predictions (the probabilities) that can be seen as parameters of the distribution for the count observations, under the stated parametric distributional assumption. But the probabilities are not explicitly predictions of the count observations themselves. We believe (derivation in process, to be shared at a later date) that the scoring procedure outlined above is formally proper, as long as all assumptions are clearly stated to the modelers ahead of time.
Additionally, the point predictions$\hat \theta$ will be scored directly using the categorical Brier score, comparing the predicted clade proportions of sequences to the observed clade proportions on a specific day in a specific location [Susswein et al. 2023]
Conclusion
Decisions and statements presented above should be treated as preliminary and subject to change and discussion. We welcome input on any and all aspects of the design presented above. Please feel free to email comments to nick [at] umass [dot] edu or to leave a comment on this post.
Beta Was this translation helpful? Give feedback.
All reactions