read_metadata: Provide an interface to resolve duplicate strain names in metadata #725

huddlej · 2021-05-20T16:03:55Z

Context
Historically, metadata for Augur inputs were curated to exclude duplicate strains (e.g., by resolving duplicates in fauna during download from the database) and, as such, read_metadata was designed to throw an error when it found duplicate strains. However, GenBank has supported multiple versions of the same strain's sequence for a long time and GISAID recently added similar support. This means a valid metadata download from either of these sources can produce an error when Augur tries to read these data.

Description
Augur should provide an interface to resolve duplicate strain names in the metadata instead of throwing an error. We should retain the option to throw an error on duplicates, but we should also consider making duplicate resolution the default behavior.

Examples

To test the current issue, create some minimal metadata with a duplicated strain:

cut -f 1,5 data/example_metadata.tsv | head -n 4 | sed 's/VIC1008/VIC1000/' > duplicate_metadata.tsv

Then, try to load the data from a Python terminal:

>>> from augur.utils import read_metadata
>>> read_metadata("duplicate_metadata.tsv")
Traceback (most recent call last):
  File "<ipython-input-5-a6d05e39306b>", line 1, in <module>
    read_metadata("duplicate_metadata.tsv")
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 74, in read_metadata
    return MetadataFile(fname, query).read()
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 21, in read
    self.check_metadata_duplicates()
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 63, in check_metadata_duplicates
    raise ValueError(
ValueError: Duplicated strain in metadata: Australia/VIC1000/2020

Possible solution

Ideally, the solution to this issue will not require the user to do anything by default; the solution should allow users with duplicate strains to resolve these duplicates automatically.

To resolve duplicate GISAID or GenBank records, we want to prefer the record with the most recent database accession/id. We currently annotate accessions for GISAID and GenBank in our ncov workflow as gisaid_epi_isl and genbank_accession, respectively. One possible solution could then be:

Check for duplicates
If no duplicates, continue.
If duplicates, check for Augur config variable (either in global environment variables or a config file) for whether we should throw an error on duplicates and throw an error as configured.
If duplicates and no error to be thrown, check for one of our predefined accession columns (gisaid_epi_isl and genbank_accession to start).
If an accession column exists, sort records by strain and accession in ascending order and take the last record (or descending/first).
If an accession column does not exist, sort records by strain and take the first record.

The text was updated successfully, but these errors were encountered:

victorlin · 2024-08-27T19:04:41Z

Closing this as not planned per #810 (comment)

The root issue here (strain column in metadata having duplicates by design) has since been addressed by the id_columns parameter (initially as valid_index_cols in b78f015) which allows other columns with unique values to be used as the index column.

huddlej added the enhancement New feature or request label May 20, 2021

huddlej mentioned this issue May 20, 2021

Support default GISAID metadata and sequences nextstrain/ncov#640

Merged

14 tasks

victorlin mentioned this issue Aug 13, 2024

read_metadata: Allow graceful handling of duplicate strain names #810

Closed

victorlin changed the title ~~Provide an interface to resolve duplicate strain names in metadata~~ read_metadata: Provide an interface to resolve duplicate strain names in metadata Aug 27, 2024

victorlin closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_metadata: Provide an interface to resolve duplicate strain names in metadata #725

read_metadata: Provide an interface to resolve duplicate strain names in metadata #725

huddlej commented May 20, 2021

victorlin commented Aug 27, 2024

read_metadata: Provide an interface to resolve duplicate strain names in metadata #725

read_metadata: Provide an interface to resolve duplicate strain names in metadata #725

Comments

huddlej commented May 20, 2021

victorlin commented Aug 27, 2024