Add section for weighted sampling

nextstrain · Aug 19, 2024 · 6ef6a19 · 6ef6a19
1 parent f0af1b9
commit 6ef6a19
Showing 1 changed file with 34 additions and 1 deletion.
diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -180,7 +180,7 @@ For example, limit the output to 100 sequences:
      --output-metadata subsampled_metadata.tsv
 
 Random sampling is easy to define but can expose sampling bias in some datasets.
-Consider uniform sampling to reduce sampling bias.
+Consider another sampling method to reduce sampling bias.
 
 Uniform sampling
 ----------------
@@ -218,6 +218,39 @@ per month from each region:
      --output-sequences subsampled_sequences.fasta \
      --output-metadata subsampled_metadata.tsv
 
+Weighted sampling
+-----------------
+
+``--group-by-weights`` can be specified in addition to ``--group-by`` to allow
+different target sizes per group. For example, target twice the amount of
+sequences from Asia compared to other regions. First, create a file
+``weights.tsv``:
+
+.. code-block::
+
+   region	weight
+   Asia	2
+   default	1
+   ...
+
+The format specifications are described in ``augur filter`` docs for
+``--group-by-weights``.
+
+Add the option by using ``--group-by-weights weights.tsv`` in the command:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --group-by region year month \
+     --group-by-weights weights.tsv \
+     --subsample-max-sequences 100 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
 Caveats
 -------