From 6ef6a19d50bd98e766087d995d5c3020078ea0d3 Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Fri, 16 Aug 2024 15:03:55 -0700 Subject: [PATCH] Add section for weighted sampling --- .../filtering-and-subsampling.rst | 35 ++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index 96a66ce6..f1e2b924 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -180,7 +180,7 @@ For example, limit the output to 100 sequences: --output-metadata subsampled_metadata.tsv Random sampling is easy to define but can expose sampling bias in some datasets. -Consider uniform sampling to reduce sampling bias. +Consider another sampling method to reduce sampling bias. Uniform sampling ---------------- @@ -218,6 +218,39 @@ per month from each region: --output-sequences subsampled_sequences.fasta \ --output-metadata subsampled_metadata.tsv +Weighted sampling +----------------- + +``--group-by-weights`` can be specified in addition to ``--group-by`` to allow +different target sizes per group. For example, target twice the amount of +sequences from Asia compared to other regions. First, create a file +``weights.tsv``: + +.. code-block:: + + region weight + Asia 2 + default 1 + ... + +The format specifications are described in ``augur filter`` docs for +``--group-by-weights``. + +Add the option by using ``--group-by-weights weights.tsv`` in the command: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --group-by region year month \ + --group-by-weights weights.tsv \ + --subsample-max-sequences 100 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + Caveats -------