Skip to content

Commit

Permalink
Add section for weighted sampling
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin committed Aug 19, 2024
1 parent f0af1b9 commit 6ef6a19
Showing 1 changed file with 34 additions and 1 deletion.
35 changes: 34 additions & 1 deletion src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ For example, limit the output to 100 sequences:
--output-metadata subsampled_metadata.tsv
Random sampling is easy to define but can expose sampling bias in some datasets.
Consider uniform sampling to reduce sampling bias.
Consider another sampling method to reduce sampling bias.

Uniform sampling
----------------
Expand Down Expand Up @@ -218,6 +218,39 @@ per month from each region:
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
Weighted sampling
-----------------

``--group-by-weights`` can be specified in addition to ``--group-by`` to allow
different target sizes per group. For example, target twice the amount of
sequences from Asia compared to other regions. First, create a file
``weights.tsv``:

.. code-block::
region weight
Asia 2
default 1
...
The format specifications are described in ``augur filter`` docs for
``--group-by-weights``.

Add the option by using ``--group-by-weights weights.tsv`` in the command:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--group-by region year month \
--group-by-weights weights.tsv \
--subsample-max-sequences 100 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
Caveats
-------

Expand Down

0 comments on commit 6ef6a19

Please sign in to comment.