From 17ca9602d646b9802899ca5141bcf774634a97b6 Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Mon, 19 Aug 2024 12:37:25 -0700 Subject: [PATCH] Adjust multiple augur filter section for weighted sampling Weighted sampling makes this scenario technically feasible, but practically difficult to achieve in a single augur filter call. Explain this trade-off in detail. --- .../filtering-and-subsampling.rst | 36 ++++++++++++++----- 1 file changed, 28 insertions(+), 8 deletions(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index f1e2b924..901c8e0b 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -266,22 +266,42 @@ Subsampling using multiple ``augur filter`` commands ==================================================== There are some subsampling strategies in which a single call to ``augur filter`` -does not suffice. One such strategy is "tiered subsampling". In this strategy, -mutually exclusive sets of filters, each representing a "tier", are sampled with -different subsampling rules. This is commonly used to create geographic tiers. -Consider this subsampling scheme: +does not suffice or is difficult to put together. One such strategy is "tiered +subsampling". In this strategy, mutually exclusive sets of filters, each +representing a "tier", are sampled with different subsampling rules. This is +commonly used to create geographic tiers. Consider this subsampling scheme: Sample 100 sequences from Washington state and 50 sequences from the rest of the United States. -This cannot be done in a single call to ``augur filter``. Instead, it can be -decomposed into multiple schemes, each handled by a single call to ``augur -filter``. Additionally, there is an extra step to combine the intermediate -samples. +This can be approximated by ``--subsample-max-sequences 150`` + ``--group-by region`` + +``--group-by-weights weights.tsv`` with this ``weights.tsv``: + +.. code-block:: + + state weight + WA 100 + OR 1.02 + CA 1.02 + ... + +The above is rather complex, needing a list of all other states and a calculation to determine their weights: + +.. math:: + + {n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}} = 50 * \frac{1}{49} \approx 1.02 + +A simpler approach is to decompose this into multiple schemes, each handled by a +single call to ``augur filter``. Additionally, there is an extra step to combine +the intermediate samples. 1. Sample 100 sequences from Washington state. 2. Sample 50 sequences from the rest of the United States. 3. Combine the samples. +.. note:: + + FIXME: add note on difference compared to previous example due to lack of ``--group-by`` + Calling ``augur filter`` multiple times ---------------------------------------