Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add weighted sampling docs #223

Merged
merged 5 commits into from
Aug 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 130 additions & 27 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Selection method:
* - Subsampling
- * ``--subsample-max-sequences``
* ``--group-by``
* ``--group-by-weights``
* ``--sequences-per-group``
* ``--probabilistic-sampling``
* ``--no-probabilistic-sampling``
Expand Down Expand Up @@ -158,10 +159,11 @@ Subsampling

Another common filtering operation is **subsampling**: selection of data using
rules based on output size rather than individual sequence attributes. These are
the sampling methods supported by ``augur filter`` and a final section for caveats:
the sampling methods supported by ``augur filter``:

.. contents::
:local:
:depth: 2

Random sampling
---------------
Expand All @@ -181,14 +183,25 @@ For example, limit the output to 100 sequences:
--output-metadata subsampled_metadata.tsv

Random sampling is easy to define but can expose sampling bias in some datasets.
Consider uniform sampling to reduce sampling bias.
Consider using grouped sampling to reduce sampling bias.

Uniform sampling
Grouped sampling
----------------

``--group-by`` allows you to partition the data into groups based on column
values and sample uniformly. For example, sample evenly across countries over
time:
values and sample a number of sequences per group.

Grouped sampling can be further divided into two types with a final section for
caveats:

.. contents::
:local:

Uniform sampling
~~~~~~~~~~~~~~~~

By default (i.e. without ``--group-by-weights``), ``--group-by`` will sample
uniformly across groups. For example, sample evenly across regions over time:

.. code-block:: bash

Expand All @@ -197,15 +210,15 @@ time:
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--group-by country year month \
--group-by region year month \
--subsample-max-sequences 100 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv

An alternative to ``--subsample-max-sequences`` is ``--sequences-per-group``.
This is useful if you care less about total sample size and more about having
a fixed number of sequences from each group. For example, target one sequence
per month from each country:
per month from each region:

.. code-block:: bash

Expand All @@ -214,19 +227,57 @@ per month from each country:
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--group-by country year month \
--group-by region year month \
--sequences-per-group 1 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv

Weighted sampling
~~~~~~~~~~~~~~~~~

``--group-by-weights`` can be specified in addition to ``--group-by`` to allow
different target sizes per group. For example, target twice the amount of
sequences from Asia compared to other regions using this ``weights.tsv`` file:

.. list-table::
:header-rows: 1

* - region
- weight
* - Asia
- 2
* - default
- 1

and command:

.. code-block:: bash

augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--group-by region year month \
--group-by-weights weights.tsv \
--subsample-max-sequences 100 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv

The weights file format is described in ``augur filter`` docs for
``--group-by-weights``.

Caveats
~~~~~~~

Probabilistic sampling
----------------------
``````````````````````

It is possible to encounter situations in uniform sampling where the number of
groups exceeds the target sample size. For example, consider a command with
groups defined by ``--group-by country year month`` and target sample size
defined by ``--subsample-max-sequences 100``. If the input contains data from 5
countries over a span of 24 months, that could result in 120 groups.
It is possible to encounter situations where the number of groups exceeds the
target sample size. For example, consider a command with groups defined by
``--group-by region year month`` and target sample size defined by
``--subsample-max-sequences 100``. If the input contains data from 5 regions
over a span of 24 months, that could result in 120 groups.

The only way to target 100 sequences from 120 groups is to apply **probabilistic
sampling** which randomly determines a whole number of sequences per group. This
Expand All @@ -239,11 +290,12 @@ is noted in the output:
possible to have more than the requested maximum of 100 sequences after
filtering.

This is automatically enabled. To force the command to exit with an error in
these situations, use ``--no-probabilistic-sampling``.
This is automatically enabled. ``--no-probabilistic-sampling`` can be used with
uniform sampling to force the command to exit with an error in these situations.
It is always be enabled for weighted sampling.

Caveats
-------
Undersampling
`````````````

For these sampling methods, the number of targeted sequences per group does not
take into account the actual number of sequences available in the input data.
Expand All @@ -257,17 +309,62 @@ Subsampling using multiple ``augur filter`` commands
====================================================

There are some subsampling strategies in which a single call to ``augur filter``
does not suffice. One such strategy is "tiered subsampling". In this strategy,
mutually exclusive sets of filters, each representing a "tier", are sampled with
different subsampling rules. This is commonly used to create geographic tiers.
Consider this subsampling scheme:
does not suffice or is difficult to create. One such strategy is "tiered
subsampling". In this strategy, mutually exclusive sets of filters, each
representing a "tier", are sampled with different subsampling rules. This is
commonly used to create geographic tiers. Consider this subsampling scheme:

Sample 200 sequences from Washington state and 100 sequences from the rest of
the United States.

This can be approximated by first selecting all sequences from the United States
then sampling with these weights:

.. list-table::
:header-rows: 1

* - state
- weight
* - WA
- 200
* - default
- 2.04

.. code-block:: bash

augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "country == 'USA'" \
--group-by state \
--group-by-weights weights.tsv \
--subsample-max-sequences 300 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv

This approach has some caveats:

Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.
1. It relies on a calculation to determine weights, making it less intuitive:

This cannot be done in a single call to ``augur filter``. Instead, it can be
decomposed into multiple schemes, each handled by a single call to ``augur
filter``. Additionally, there is an extra step to combine the intermediate
samples.
.. math::

{n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}}
= 100 * \frac{1}{49}
\approx 1.02

2. Achieving a full *100 sequences from the rest of the United States* requires
at least 2 sequences from each of the remaining states. This may not be
possible if some states are under-sampled.

Intuitiveness for caveat (1) can be improved by adding a comment to the weights
file. However, caveat (2) is an inherent limitation of what is effectively
uniform sampling across all other states. The only way to get around this in
``augur filter`` is **random sampling** across states, but that is not possible
when ``state`` is used as a grouping column.

An alternative approach is to decompose this into multiple schemes, each handled
by a single call to ``augur filter``. Additionally, there is an extra step to
combine the intermediate samples.

1. Sample 100 sequences from Washington state.
2. Sample 50 sequences from the rest of the United States.
Expand Down Expand Up @@ -313,6 +410,12 @@ and ``--include`` to sample the data based on the intermediate strain list
files. If the same strain appears in both files, ``augur filter`` will only
write it once in each of the final outputs.

.. note::

The 2nd sample does not use ``--group-by``, implying **random sampling**
across states. This differs from previous approach that used a single ``augur
filter`` command with weighted sampling.

Generalizing subsampling in a workflow
--------------------------------------

Expand Down
Loading