Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use weighted sampling #1141

Open
4 of 5 tasks
victorlin opened this issue Aug 14, 2024 · 2 comments · Fixed by #1151
Open
4 of 5 tasks

Use weighted sampling #1141

victorlin opened this issue Aug 14, 2024 · 2 comments · Fixed by #1151
Assignees
Labels
enhancement New feature or request

Comments

@victorlin
Copy link
Member

victorlin commented Aug 14, 2024

Note

#1106 came first. This is a higher level summary written after some design discussions happened in that PR.

Context

There is much sampling bias in SARS-CoV-2 data. One of the goals of this workflow is to produce datasets that are representative of real-world incidence, stripping away as much sampling bias as possible.

Currently, this is approximated by sampling with various group_bys - a combination of geographic (division/country) and temporal (month/week) attributes - to define groups that are then uniformly sampled based on a target max_sequences.

The need for uniform sampling at the group level is an inherent limitation of augur filter. It has prompted workarounds in this workflow such as #1074.

Proposal

There is a proposal to remove the limitation of augur filter: nextstrain/augur#1318. The option to specify sampling weights could be directly used in this workflow. Population-based weighted sampling would bring this workflow one step closer to representing real-world incidence, though there will still be some inherent sampling bias¹.

Case count data was also considered as a potential source of weights, however it was determined that population data would be a better source. See discussion: #1106 (comment)

¹ weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries would still be under-sampled, resulting in fewer total sequences than requested by max_sequences. This is already the case with current uniform sampling, but it may be more noticeable under population-based weighted sampling for large countries that are under-sampled.

Progress

@corneliusroemer
Copy link
Member

Reopening as the implementation didn't seem to work for at least global builds, see for context:

As implemented (prior to revert), the recent samples become obsolete as they become essentially subsets of the longer window builds. The intent of recent builds (1m/2m) was to have builds that include as many recent sequences as possible as those are the most interesting for spotting new developments.

I don't know how population weighting is implemented. Based on the result observed, it seems plausible to me that "max sequences" assumes that all countries contribute their full quota which is very much not true at a global scale and especially so for recent submissions.

This means that the effective sample will be much lower than the "max sequences", in contradistinction to the original augur filter meaning of "max sequences", where you would pretty much get max sequences no matter the grouping.

A temporary workaround, if one wanted to keep using the new population weighting feature, is to scale up the parameter of max sequences to something like 15k to get an effective sample of 4k. This would be theoretically risky as if countries where to scale up sequencing, we could end up with more sequences than we really want (~5k) [this is unlikely to happen in practice so is more of a theoretical issue]. Another issue is that over time as sequencing activity will likely further decrease, we would have to further increase the knob of max sequences, not ideal.

After reviewing more of the prior PRs, I realize that the revert of #1161 doesn't fully restore the approach prior to new population weighted sampling for global builds. That would require reinstating the splitting of countries in Asia.

@victorlin victorlin reopened this Jan 2, 2025
@victorlin
Copy link
Member Author

I don't know how population weighting is implemented. Based on the result observed, it seems plausible to me that "max sequences" assumes that all countries contribute their full quota which is very much not true at a global scale and especially so for recent submissions.

Population weighting divides the max sequences among countries per capita instead of the default equal weighting. The concept of "max sequences" for both weighted and uniform sampling is a limitation of augur filter – it doesn't take into consideration what's actually available in the input data (i.e. the problem is under-sampling which becomes more apparent with population weights when large countries do not contribute many samples).

After reviewing more of the prior PRs, I realize that the revert of #1161 doesn't fully restore the approach prior to new population weighted sampling for global builds. That would require reinstating the splitting of countries in Asia.

Continuing this in #1161 (comment), but I think it may be worth reconsidering whether to use weighted sampling at all for the 1m/2m focal samples or simply take whatever is available at the time. Weighted sampling only makes sense when there are enough samples and minimal under-sampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants