SARS-CoV-2 Variant Nowcast Hub Draft Guidelines #18

nickreich · 2024-06-24T18:10:18Z

nickreich
Jun 24, 2024
Maintainer

Introduction and background

Since fall 2023, a group of researchers from the US CDC Center of Forecasting and Outbreak Analytics (CFA) and the Reich Lab at UMass Amherst, in consultation with folks from the NextStrain project, have developed a plan to launch the SARS-CoV-2 Variant Nowcast Hub.

Collaborative and open forecast hubs have emerged as a valuable way to centralize and coordinate predictive modeling efforts for public health. In realms where multiple teams are tackling the same problem using different data inputs and/or modeling methodologies, a hub can standardize targets in ways that facilitate model comparison and the integration of outputs from multiple models into public health practice.

While SARS-CoV-2 variant dynamics received most attention from the scientific community in 2021 and 2022, SARS-CoV-2 genomic sequences continue to be generated, and trends in which variants are predominating will continue to impact transmission across the US and the world. From a modeling perspective, there is less consensus about a standard way to represent model outputs for multivariate variant predictions than there is for other outcomes. Therefore, a key reason we want to build this nowcast hub is to help learn about the right way to build this kind of collaborative modeling effort, potentially not just for SARS-CoV-2 but also for other rapidly evolving pathogens.

This post will summarize some of the thinking to-date about the set-up of the repository, with the goal being to let people know publicly about the effort and solicit some feedback about different choices that are being made while changes can still be easily made. Comments and further discussion on all items about this proposed hub are welcome.

Timeline

Discussions and planning for this hub have been ongoing since fall 2023, and we have a target launch date of fall 2024.

What will modelers be asked to predict?

Generally, we plan to solicit predictions of frequencies of the predominant lineages or clades in the United States, at a daily timescale and the geographic resolution of states and territories (Washington DC and Puerto Rico) in the US. Details about these choices follow in subsections below. The Hub will solicit predictions of frequencies (i.e., numbers between 0 and 1) associated with each clade or group of clades, for a particular location and a particular day.

Submission deadlines

We are planning for submissions to be due at 8pm ET every Wednesday. This time was chosen to give modelers time in the beginning of the week to run and adjust models and stakeholders time at the end of the week to incorporate preliminary results into discussions or decision making.

Processed datasets, selecting variant clades

Each week, several days prior to the submission deadline, the hub administrators will provide two datasets

A csv file with columns for NextClade clade names, location, day, and count of sequences.
A text file that contains the names of the clades that will be accepted in submission files for the upcoming deadline.

To create the clade count file (number 1 in the list above), hub administrators will oversee the creation of an aggregated version of the NextStrain USA sequence count file based on US data from GenBank. The NextStrain file is typically updated daily in the late evening US eastern time (it is only updated when new data are available). The Variant Nowcast Hub version of the file will pull the most recent version of the file when the workflow runs each week. The precise lineage assignment model (sometimes referred to as a “reference tree”) that was used as well as the version of raw sequence data will also be stored as metadata, to facilitate reproducibility and evaluation. Versions of these target data files will be maintained by the hub, so modeling teams could come back and retrospectively test their models on data as they were available in real time.

To create the file that lists which clades will be accepted for a given week (file number 2 in the list above), in each submission week we will select up to ten NextStrain clades from among those that had a reported prevalence of at least 1% across the US in any of the past three MMWR weeks. Any clades with prevalence of less than 1% will be grouped into an “other” category for which predictions of combined prevalence will also be collected. No more than 10 clades (including “other”) will be selected in a given week.

Prediction horizon

Genomic sequences tend to be reported weeks after being collected. Therefore, recent data is subject to quite a lot of backfill. For this reason, we will collect "nowcasts" (predictions for data relevant to times prior to the current time, but not yet observed) and some "forecasts" (predictions for future observations). Counting the Wednesday submission date as a prediction horizon of zero, we will collect daily-level predictions for 10 days into the future (the Saturday that ends the epidemic week after the Wednesday submission) and -31 days into the past (the Sunday that starts the epidemic week four weeks prior to the Wednesday submission date). Overall, six weeks (42 days) of predicted values will be solicited each week.

Model output format

This hub will follow (hubverse)[https://hubverse.io/en/latest/] data standards. Submissions will be asked to contain either or both of mean and sample-based model outputs. We use the term “model task” below to refer to a prediction for a specific clade, location and horizon. For example, if mean model outputs are submitted, there will be one value between 0 and 1 for each model task. The submitted values for all clades must sum to 1 for a given state and horizon. As we will describe in further detail below, the target for prediction is the proportion of circulating viral genomes for a given location and target date where the patient is infected with a specified clade of the SARS-CoV-2 virus.

To submit probabilistic predictions, a (sample format)[https://hubverse.io/en/latest/user-guide/sample-output-type.html] will be used to encode samples from the predictive distribution for each model task. We will require exactly 100 samples for each model task. One key advantage to submitting sample-based output is that dependence can be encoded across horizons (corresponding to trajectories of variant prevalence over time), or even across locations (see details in Hubverse sample model-output specifications). For this Variant Nowcast Hub, we will require that samples be submitted in such a way as to imply that they are structured into trajectories across clades and horizons. (See following section for how variants will be classified into clade categories.) This means that
a) at each location and horizon a common sample ID (in the ouput_type_id column) will help us ensure that the clade proportions sum to 1, and
b) for each location and clade, common sample IDs across horizons will allow us to draw trajectories by clade.
This specification corresponds to a hubverse-style “compound modeling task” that includes the following fields: "reference_date", "location". Samples then capture dependence across the complementary set of task ids: “horizon”, “clade”.

We note that sample IDs present in the output_type_id column of submissions are not necessarily inherent properties of how the samples are generated, as they can be changed post-hoc by a modeler. For example, some models may make nowcasts independently by horizon but the samples could be tied together either randomly or via some other correlation structure or secondary model to assign sample IDs that are consistent across horizons. As another example, some models may make forecasts that have joint dependence structure across locations as well as horizons. Sample IDs could be shared across locations as well, but this is not required for the submission to pass validation.

While the hub will collect predictive means, to be included in the hub ensemble model, samples must be submitted and the mean forecast for the hub ensemble will be obtained as a summary of sample predictions.

Model evaluation challenges

Several features of these data in particular make evaluations tricky.

Data for some model tasks may be partially observed at the time nowcasts and forecasts are made. The Hub wants to encourage teams to submit predictions of “true” underlying clade probabilities that will vary more or less smoothly, if sometimes steeply, over time. When some observations are partially observed at the time of nowcast submissions, it could be to the modeler’s advantage to predict a value that is close to the frequency observed at the time the forecast is made, thus deviating from the underlying (perhaps smooth) function the model would predict in the absence of data. To incentivize “honest” nowcasts that do not shift predictions for time-points with partial observations, we will only evaluate locations and dates for which no data have yet been reported at the time the processed dataset is created (see Processed datasets section below). One implication of this decision is that different numbers of days may be evaluated for some locations when compared with others.
The reference phylogenetic tree that defines clades changes over time. Nowcasts and forecasts will be evaluated against whatever sequence data is available 90 days after the deadline that a set of predictions were submitted for. Additionally, those sequences will be assigned a clade based on the reference tree that was used to generate the data immediately prior to the submission date. This means that new clades that emerge in the time since the predictions were made will still be classified as they would have been when predictions were made.
The variance in the eventually observed clade counts depends on the eventual sample size, or number of sequences tested on a particular day. With a large number of sequences, the variance would tend to be larger and with a small number of sequences the variance would be smaller. However, the number of sequences itself is not of particular epidemiological interest. The evaluation plan introduced below evaluates the counts assuming they follow a multinomial distribution with sample size equal to the number of samples for the target date and location that have been reported as of the evaluation date, so as to eliminate the nuisance parameter of the count variance.

Model evaluation

We will collect nowcasts for $\theta$, a $K$-vector, where $K$ is the number of clades we are interested in, and whose $k$th element, $\theta_k$ , is the true proportion of all current SARS-CoV-2 infections which are clade $k$. We observe $C = (C_1, … , C_K)$, the vector of observed counts for each of the $K$ clades of interest for a particular location and target date, and let $N = \sum_k C_k$ be the total number of sequences collected for that date and location. Variation in $C$ depends on the total number of sequenced samples, $N$. Thus, accurate nowcasts of the observed $C$, would require teams to model and forecast $N$, which is not of epidemiological interest.

To avoid a situation where the distribution of the prediction target depends on $N$, nowcasts are to be submitted in the form of 100 samples $\hat \theta^{(1)}, …, \hat \theta^{(100)}$ from the predictive distribution for $\theta$. Historical data show that 90 days is sufficient time for nearly all sequences to be tested and reported and therefore for $C$ to represent a stable estimate of relative clade prevalences. Therefore, 90 days after each submission date, the Hub will use the total number of observed samples, $N$, and the clade proportion nowcasts $\hat \theta^{(1)}, …, \hat \theta^{(100)}$ to generate nowcasts for observed clade counts, $\hat C$, by sampling from multinomial distributions. Specifically, the hub will generate predictions for observed clade counts $\hat C^{(1)}, …, \hat C^{(100)}$ where each $\hat C^{(s)}$ is drawn from a $Multinomial(N, \hat \theta^{(s)})$ distribution.

The use of a multinomial distribution assumes that, conditional on the mean prevalence, clade assignments for the sequenced samples are independent and have probability of being in each clade equal to the population probabilities $\theta$. Furthermore, while the use of a multinomial distribution with size $N$ gets around the need for teams to model the number of sequences at a given time, it also introduces a specific assumption about the variation in the observation process that takes probabilities and a size $N$ and turns them into cases. If a team does not believe these assumptions, they may wish to modify their distribution for $\theta$ accordingly. For example, if a team believes that an overdispersed Dirichlet-Multinomial distribution would more accurately model the variation in future observations, they should add dispersion to their distribution for $\theta$. Or if a team believes that sampling is biased and some clades are underrepresented in the reported data, they may wish to modify their estimate of $\theta$ to reflect the reporting process.

These count forecasts $\hat C^{(1)}, …, \hat C^{(100)}$ will be scored on the observed counts $C$, using the energy score [Gneiting et al. 2008, Jordan et al. 2019], a proper scoring rule for multivariate data, which uses samples from the forecast distribution to compute scores. We note that the energy score procedure described above scores predictions (the probabilities) that can be seen as parameters of the distribution for the count observations, under the stated parametric distributional assumption. But the probabilities are not explicitly predictions of the count observations themselves. We believe (derivation in process, to be shared at a later date) that the scoring procedure outlined above is formally proper, as long as all assumptions are clearly stated to the modelers ahead of time.

Additionally, the point predictions $\hat \theta$ will be scored directly using the categorical Brier score, comparing the predicted clade proportions of sequences to the observed clade proportions on a specific day in a specific location [Susswein et al. 2023]

Conclusion

Decisions and statements presented above should be treated as preliminary and subject to change and discussion. We welcome input on any and all aspects of the design presented above. Please feel free to email comments to nick [at] umass [dot] edu or to leave a comment on this post.

kaitejohnson · 2024-06-28T15:42:41Z

kaitejohnson
Jun 28, 2024

Really excited to see this coming together. Just a few remaining thoughts/questions, summarized from some conversations with colleagues:

It sounds like the current plan is not to solicit from submitters an estimate of the national clade frequencies. This seems like a reasonable decision, because in order to make that estimate, teams would need to attempt to account for the disease prevalence represented by each sequence in each location (here state). For example, 25 sequences from 1000 underlying infections from one state should have less weight in the national frequency than 25 sequences from 100 underlying infections from another state. The CDC's survey-design approach is one way that the "weight" or representation of infections of each sequence can be accounted for Paul et al. 2021. However, current reporting practices make this more difficult because of the lack of reliable case data in each state. It should probably be communicated somewhere that any sort of an averaging of state clade frequencies would not be a correct way of achieving a national clade frequency estimate, because they represent different numbers of people.
Is it fair to rephrase slightly this statement : "As we will describe in further detail below, the target for prediction is the proportion of circulating viral genomes for a given location and target date where the patient is infected with a specified clade of the SARS-CoV-2 virus.", that what is actually being targeted is the proportion of circulating viral genomes for a given location and target date amongst infected individuals that are sequenced for SARS-CoV-2? In other words, the data that we will evaluate against are the proportions among the final sequences. We don't have an alternative data source to estimate true underlying variant proportions among infections (e.g. through a random, representative sample of infected individuals).

0 replies

sjfox · 2024-07-23T16:21:46Z

sjfox
Jul 23, 2024

Hey y'all,

Thank you so much for putting these guidelines together and all of your hard work organizing this new forecasting effort! We're excited to take part and just had a few comments/questions as I read through the guidelines:

The guidelines state: “in each submission week we will select up to ten NextStrain clades from among those that had a reported prevalence of at least 1% across the US in any of the past three MMWR weeks.” However, the guidelines also note that the data from recent weeks is quite preliminary and updates are very common. Will the 1% be for the 3 weeks preceding the first hindcast target of -31, or will it be based on the most recent 3 weeks? Not sure how much this would impact things either way, but thought I would mention it (e.g. you can craft some edge case scenarios where you may get wonky outcomes)
“we will only evaluate locations and dates for which no data have yet been reported at the time the processed dataset is created” I’m not sure I understand this. Does this mean that if on day 0, there are no data for only a few days from a handful of states those are the only days/regions that will be used in the score? It seems like this would lead to biases as we are potentially scoring only on the smallest states or those with the laggiest reporting systems? Maybe an alternative would be to break down the score into hindcast/nowcast and forecast scores and models are evaluated for those components separately? It may be useful to investigate the missingness so at least we have some expectation.
“The variance in the eventually observed clade counts depends on the eventual sample size, or number of sequences tested on a particular day.” I assume this refers to the day-to-day variance in counts from the reporting system? I would have guessed the raw variance is typically larger for larger sequence counts, but that the variance relative to the absolute value of counts is less important for larger counts – maybe worth clarifying this is it’s what you meant, or maybe my expectation was wrong.
For the main scoring, I am understanding it as using our predicted prevalences to sample case counts from the multinomial distribution and then evaluating those case counts. Why not calculate it according to some aggregate number from the multinomial density using something the dmultinom function in R with our submitted theta and the N value? If you do want to compare against the counts but don’t want people to have to predict the counts, couldn’t we take the final N multiply by theta and round to the nearest integer? I’m slightly concerned that we don’t have direct control of the ultimate quantity that is actually evaluated.
If the scores are made from random draws the results could change based on the seed used when evaluated. Overall relative scores may not change much, but I think we would need some way to replicate the final "scoring process," so maybe we need to think of a way to store the draws used by the hub for later evaluation or at least a seed that could be used for replicating the results?
I may have missed it, but I interpreted the scoring methodology mentioned above as referring to a single location. It would be great to outline how scores will be aggregated across locations

Thank you so much for all of your hard work putting this together!

0 replies

elray1 · 2024-08-15T18:59:06Z

elray1
Aug 15, 2024
Maintainer

Above, we list two data sets that would be maintained by the hub:

A csv file with columns for NextClade clade names, location, day, and count of sequences.
A text file that contains the names of the clades that will be accepted in submission files for the upcoming deadline.

I have three comments on these:

another data set that we've discussed creating would be one with a similar structure as data set 1 on this list, but intended for use in evaluation (and possibly a viz, see next point). This would include the columns named for data set 1 (clade, location, day, count), but would need an additional column like reference_date or similar to indicate the reference date the counts are relevant to. For a given reference date, the clades would be those that were used for predictions generated on the given reference date (including 'other'). The counts would be based on the reference tree that was in use on that reference date.
in a discussion among folks thinking about setting up this hub just now, we were unclear on whether it would be necessary for the hub to provide data set number 1. It seems that the hub itself would not have a need for that data set: it is not needed for evaluation, the data used for evaluation are what I described in the previous point. similarly, for visualization, it seems like the most appropriate thing to display on a plot is the eventually observed target values for the predictions, i.e., the data to be used for evaluation. and the hub itself doesn't have other uses for data that we can think of. So, data set 1 would just be a convenience for modelers. However, it might be nice to get the hub out of the game of managing data sets. As an alternative, the hub could provide instructions for how modelers could obtain the data set, by downloading a data file from the Bedford lab's website and applying a function we supply (in Python) to do data cleaning and summarization.
- On further reflection, I see two possible arguments for the hub providing data set 1 rather than just instructions for how to create it: (1) it's friendlier to modelers working in languages other than Python (our data cleaning/summary function will be in Python); and (2) if versioned historical snapshots of the data are not available from the Bedford lab, the hub providing this data gives a mechanism for tracking data versions (this is mentioned in the description above).
For data set 2, I'd like to propose that we consider using a structure json format for this file. Maybe something like an object where the keys are reference dates and the values are the list of "non-other"/modeled clades for that week:

{
  '2024-10-02': ['KP.1', 'KP.2', 'KP.3'],
  '2024-10-09': ['KP.2', 'KP.3', 'XY.Z']
}

Open to other ideas, but we will want something that allows us to track the variable-length list of modeled variants for each reference date.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SARS-CoV-2 Variant Nowcast Hub Draft Guidelines #18

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

SARS-CoV-2 Variant Nowcast Hub Draft Guidelines #18

nickreich Jun 24, 2024 Maintainer

Introduction and background

Timeline

What will modelers be asked to predict?

Submission deadlines

Processed datasets, selecting variant clades

Prediction horizon

Model output format

Model evaluation challenges

Model evaluation

Conclusion

Replies: 3 comments

kaitejohnson Jun 28, 2024

sjfox Jul 23, 2024

elray1 Aug 15, 2024 Maintainer

nickreich
Jun 24, 2024
Maintainer

kaitejohnson
Jun 28, 2024

sjfox
Jul 23, 2024

elray1
Aug 15, 2024
Maintainer