diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index e6d642d..e9dc77f 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -53,6 +53,7 @@ Selection method: * - Subsampling - * ``--subsample-max-sequences`` * ``--group-by`` + * ``--group-by-weights`` * ``--sequences-per-group`` * ``--probabilistic-sampling`` * ``--no-probabilistic-sampling`` @@ -158,10 +159,11 @@ Subsampling Another common filtering operation is **subsampling**: selection of data using rules based on output size rather than individual sequence attributes. These are -the sampling methods supported by ``augur filter`` and a final section for caveats: +the sampling methods supported by ``augur filter``: .. contents:: :local: + :depth: 2 Random sampling --------------- @@ -181,14 +183,25 @@ For example, limit the output to 100 sequences: --output-metadata subsampled_metadata.tsv Random sampling is easy to define but can expose sampling bias in some datasets. -Consider uniform sampling to reduce sampling bias. +Consider using grouped sampling to reduce sampling bias. -Uniform sampling +Grouped sampling ---------------- ``--group-by`` allows you to partition the data into groups based on column -values and sample uniformly. For example, sample evenly across countries over -time: +values and sample a number of sequences per group. + +Grouped sampling can be further divided into two types with a final section for +caveats: + +.. contents:: + :local: + +Uniform sampling +~~~~~~~~~~~~~~~~ + +By default (i.e. without ``--group-by-weights``), ``--group-by`` will sample +uniformly across groups. For example, sample evenly across regions over time: .. code-block:: bash @@ -197,7 +210,7 @@ time: --metadata data/metadata.tsv \ --min-date 2012 \ --exclude exclude.txt \ - --group-by country year month \ + --group-by region year month \ --subsample-max-sequences 100 \ --output-sequences subsampled_sequences.fasta \ --output-metadata subsampled_metadata.tsv @@ -205,7 +218,7 @@ time: An alternative to ``--subsample-max-sequences`` is ``--sequences-per-group``. This is useful if you care less about total sample size and more about having a fixed number of sequences from each group. For example, target one sequence -per month from each country: +per month from each region: .. code-block:: bash @@ -214,19 +227,57 @@ per month from each country: --metadata data/metadata.tsv \ --min-date 2012 \ --exclude exclude.txt \ - --group-by country year month \ + --group-by region year month \ --sequences-per-group 1 \ --output-sequences subsampled_sequences.fasta \ --output-metadata subsampled_metadata.tsv +Weighted sampling +~~~~~~~~~~~~~~~~~ + +``--group-by-weights`` can be specified in addition to ``--group-by`` to allow +different target sizes per group. For example, target twice the amount of +sequences from Asia compared to other regions using this ``weights.tsv`` file: + +.. list-table:: + :header-rows: 1 + + * - region + - weight + * - Asia + - 2 + * - default + - 1 + +and command: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --group-by region year month \ + --group-by-weights weights.tsv \ + --subsample-max-sequences 100 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + +The weights file format is described in ``augur filter`` docs for +``--group-by-weights``. + +Caveats +~~~~~~~ + Probabilistic sampling ----------------------- +`````````````````````` -It is possible to encounter situations in uniform sampling where the number of -groups exceeds the target sample size. For example, consider a command with -groups defined by ``--group-by country year month`` and target sample size -defined by ``--subsample-max-sequences 100``. If the input contains data from 5 -countries over a span of 24 months, that could result in 120 groups. +It is possible to encounter situations where the number of groups exceeds the +target sample size. For example, consider a command with groups defined by +``--group-by region year month`` and target sample size defined by +``--subsample-max-sequences 100``. If the input contains data from 5 regions +over a span of 24 months, that could result in 120 groups. The only way to target 100 sequences from 120 groups is to apply **probabilistic sampling** which randomly determines a whole number of sequences per group. This @@ -239,11 +290,12 @@ is noted in the output: possible to have more than the requested maximum of 100 sequences after filtering. -This is automatically enabled. To force the command to exit with an error in -these situations, use ``--no-probabilistic-sampling``. +This is automatically enabled. ``--no-probabilistic-sampling`` can be used with +uniform sampling to force the command to exit with an error in these situations. +It is always be enabled for weighted sampling. -Caveats -------- +Undersampling +````````````` For these sampling methods, the number of targeted sequences per group does not take into account the actual number of sequences available in the input data. @@ -257,17 +309,62 @@ Subsampling using multiple ``augur filter`` commands ==================================================== There are some subsampling strategies in which a single call to ``augur filter`` -does not suffice. One such strategy is "tiered subsampling". In this strategy, -mutually exclusive sets of filters, each representing a "tier", are sampled with -different subsampling rules. This is commonly used to create geographic tiers. -Consider this subsampling scheme: +does not suffice or is difficult to create. One such strategy is "tiered +subsampling". In this strategy, mutually exclusive sets of filters, each +representing a "tier", are sampled with different subsampling rules. This is +commonly used to create geographic tiers. Consider this subsampling scheme: + + Sample 200 sequences from Washington state and 100 sequences from the rest of + the United States. + +This can be approximated by first selecting all sequences from the United States +then sampling with these weights: + +.. list-table:: + :header-rows: 1 + + * - state + - weight + * - WA + - 200 + * - default + - 2.04 + +.. code-block:: bash + + augur filter \ + --sequences sequences.fasta \ + --metadata metadata.tsv \ + --query "country == 'USA'" \ + --group-by state \ + --group-by-weights weights.tsv \ + --subsample-max-sequences 300 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + +This approach has some caveats: - Sample 100 sequences from Washington state and 50 sequences from the rest of the United States. +1. It relies on a calculation to determine weights, making it less intuitive: -This cannot be done in a single call to ``augur filter``. Instead, it can be -decomposed into multiple schemes, each handled by a single call to ``augur -filter``. Additionally, there is an extra step to combine the intermediate -samples. + .. math:: + + {n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}} + = 100 * \frac{1}{49} + \approx 1.02 + +2. Achieving a full *100 sequences from the rest of the United States* requires + at least 2 sequences from each of the remaining states. This may not be + possible if some states are under-sampled. + +Intuitiveness for caveat (1) can be improved by adding a comment to the weights +file. However, caveat (2) is an inherent limitation of what is effectively +uniform sampling across all other states. The only way to get around this in +``augur filter`` is **random sampling** across states, but that is not possible +when ``state`` is used as a grouping column. + +An alternative approach is to decompose this into multiple schemes, each handled +by a single call to ``augur filter``. Additionally, there is an extra step to +combine the intermediate samples. 1. Sample 100 sequences from Washington state. 2. Sample 50 sequences from the rest of the United States. @@ -313,6 +410,12 @@ and ``--include`` to sample the data based on the intermediate strain list files. If the same strain appears in both files, ``augur filter`` will only write it once in each of the final outputs. +.. note:: + + The 2nd sample does not use ``--group-by``, implying **random sampling** + across states. This differs from previous approach that used a single ``augur + filter`` command with weighted sampling. + Generalizing subsampling in a workflow --------------------------------------