From 17ca9602d646b9802899ca5141bcf774634a97b6 Mon Sep 17 00:00:00 2001
From: Victor Lin <13424970+victorlin@users.noreply.github.com>
Date: Mon, 19 Aug 2024 12:37:25 -0700
Subject: [PATCH] Adjust multiple augur filter section for weighted sampling

Weighted sampling makes this scenario technically feasible, but
practically difficult to achieve in a single augur filter call. Explain
this trade-off in detail.
---
 .../filtering-and-subsampling.rst             | 36 ++++++++++++++-----
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
index f1e2b924..901c8e0b 100644
--- a/src/guides/bioinformatics/filtering-and-subsampling.rst
+++ b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -266,22 +266,42 @@ Subsampling using multiple ``augur filter`` commands
 ====================================================
 
 There are some subsampling strategies in which a single call to ``augur filter``
-does not suffice. One such strategy is "tiered subsampling". In this strategy,
-mutually exclusive sets of filters, each representing a "tier", are sampled with
-different subsampling rules. This is commonly used to create geographic tiers.
-Consider this subsampling scheme:
+does not suffice or is difficult to put together. One such strategy is "tiered
+subsampling". In this strategy, mutually exclusive sets of filters, each
+representing a "tier", are sampled with different subsampling rules. This is
+commonly used to create geographic tiers. Consider this subsampling scheme:
 
    Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.
 
-This cannot be done in a single call to ``augur filter``. Instead, it can be
-decomposed into multiple schemes, each handled by a single call to ``augur
-filter``. Additionally, there is an extra step to combine the intermediate
-samples.
+This can be approximated by ``--subsample-max-sequences 150`` +  ``--group-by region`` +
+``--group-by-weights weights.tsv`` with this ``weights.tsv``:
+
+.. code-block::
+
+   state	weight
+   WA	100
+   OR	1.02
+   CA	1.02
+   ...
+
+The above is rather complex, needing a list of all other states and a calculation to determine their weights:
+
+.. math::
+
+  {n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}} = 50 * \frac{1}{49} \approx 1.02
+
+A simpler approach is to decompose this into multiple schemes, each handled by a
+single call to ``augur filter``. Additionally, there is an extra step to combine
+the intermediate samples.
 
    1. Sample 100 sequences from Washington state.
    2. Sample 50 sequences from the rest of the United States.
    3. Combine the samples.
 
+.. note::
+
+   FIXME: add note on difference compared to previous example due to lack of ``--group-by``
+
 Calling ``augur filter`` multiple times
 ---------------------------------------