-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use weighted sampling #1141
Comments
Reopening as the implementation didn't seem to work for at least global builds, see for context:
As implemented (prior to revert), the recent samples become obsolete as they become essentially subsets of the longer window builds. The intent of recent builds (1m/2m) was to have builds that include as many recent sequences as possible as those are the most interesting for spotting new developments. I don't know how population weighting is implemented. Based on the result observed, it seems plausible to me that "max sequences" assumes that all countries contribute their full quota which is very much not true at a global scale and especially so for recent submissions. This means that the effective sample will be much lower than the "max sequences", in contradistinction to the original augur filter meaning of "max sequences", where you would pretty much get max sequences no matter the grouping. A temporary workaround, if one wanted to keep using the new population weighting feature, is to scale up the parameter of max sequences to something like 15k to get an effective sample of 4k. This would be theoretically risky as if countries where to scale up sequencing, we could end up with more sequences than we really want (~5k) [this is unlikely to happen in practice so is more of a theoretical issue]. Another issue is that over time as sequencing activity will likely further decrease, we would have to further increase the knob of max sequences, not ideal. After reviewing more of the prior PRs, I realize that the revert of #1161 doesn't fully restore the approach prior to new population weighted sampling for global builds. That would require reinstating the splitting of countries in Asia. |
Population weighting divides the max sequences among countries per capita instead of the default equal weighting. The concept of "max sequences" for both weighted and uniform sampling is a limitation of
Continuing this in #1161 (comment), but I think it may be worth reconsidering whether to use weighted sampling at all for the 1m/2m focal samples or simply take whatever is available at the time. Weighted sampling only makes sense when there are enough samples and minimal under-sampling. |
Note
#1106 came first. This is a higher level summary written after some design discussions happened in that PR.
Context
There is much sampling bias in SARS-CoV-2 data. One of the goals of this workflow is to produce datasets that are representative of real-world incidence, stripping away as much sampling bias as possible.
Currently, this is approximated by sampling with various
group_by
s - a combination of geographic (division
/country
) and temporal (month
/week
) attributes - to define groups that are then uniformly sampled based on a targetmax_sequences
.The need for uniform sampling at the group level is an inherent limitation of
augur filter
. It has prompted workarounds in this workflow such as #1074.Proposal
There is a proposal to remove the limitation of
augur filter
: nextstrain/augur#1318. The option to specify sampling weights could be directly used in this workflow. Population-based weighted sampling would bring this workflow one step closer to representing real-world incidence, though there will still be some inherent sampling bias¹.Case count data was also considered as a potential source of weights, however it was determined that population data would be a better source. See discussion: #1106 (comment)
¹ weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries would still be under-sampled, resulting in fewer total sequences than requested by
max_sequences
. This is already the case with current uniform sampling, but it may be more noticeable under population-based weighted sampling for large countries that are under-sampled.Progress
The text was updated successfully, but these errors were encountered: