Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support default GISAID metadata and sequences #640

Merged
merged 5 commits into from
May 27, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented May 19, 2021

Description of proposed changes

We would like to support GISAID's standard metadata and sequence data formats from the "Download packages" interface of EpiCoV. This PR expands the existing "sanitize" scripts for metadata and sequences with the following major changes to support these data:

  • Remove whitespace from strain names in sequence and metadata (whitespace is not allowed in record ids for FASTA deflines)
  • Strip metadata from FASTA deflines (these metadata entries should be present already in the corresponding separate metadata file)
  • Rename GISAID metadata columns to Augur-style names ("Virus name" -> "strain" and "Collection date" -> "date")
  • Parse region, country, division, and location values from the single Location field in the GISAID metadata
  • Resolve duplicate metadata records by preferring those with the latest GISAID accession number (with an option to produce an error with a list of all duplicates instead)
  • Resolve duplicate sequence records by preferring the first sequence encountered (necessary for downstream components of the workflow when the combine-and-dedup step doesn't get run and duplicates exist in the sequence data)
  • Ingest sequences (.fasta) and metadata (.tsv) files directly from GISAID tarballs (allows users to download data and run the workflow directly on those files without manually extracting/decompressing sequences and metadata)

Currently, these changes are applied to all inputs including data that are not in the GISAID default format.

Testing

  • Manually test sanitizer scripts locally with the full GISAID downloads
  • Run a small build with a 200 random strains from the full downloads
  • Tested with a full Nextstrain build on a SLURM cluster
  • Tested with AWS Batch trial build

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release -> v7
  • Update docs/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

@emmahodcroft
Copy link
Member

Apologies if this is a naive question, is this also stripping out hCoV/ or whatever the header is - so that the root sequence & exclude files will work properly?
Apologies if this is in there and I just missed it while skimming changes 🙃

@huddlej
Copy link
Contributor Author

huddlej commented May 20, 2021

@emmahodcroft Yeah, we actually do that as part of the current sanitize scripts based on these config parameters. We had to make that change to get the "Augur input" format downloads to work with our standard include/exclude files and then this PR builds on those existing scripts.

@emmahodcroft
Copy link
Member

Thanks John! I had thought I'd seen this somewhere before, but I just couldn't remember exactly where. Apologies for the repeat!

@huddlej huddlej force-pushed the support-default-gisaid-metadata branch from ad43a19 to ee3c1f7 Compare May 20, 2021 22:16
@huddlej
Copy link
Contributor Author

huddlej commented May 20, 2021

This PR implements duplicate resolution for metadata using a procedure described as a possible improvement for Augur's own metadata reader. If this approach to resolving duplicates seems reasonable, we can port this function into Augur.

This PR does not implement duplicate resolution for sequences (at least, not yet!).

@huddlej huddlej force-pushed the support-default-gisaid-metadata branch 2 times, most recently from 5579b71 to 3d315e5 Compare May 24, 2021 21:00
@huddlej
Copy link
Contributor Author

huddlej commented May 24, 2021

After some further modifications to the workflow, we can now run builds like the following that use GISAID tarballs directly as metadata and sequences inputs:

# Define inputs with preferred sequences/metadata listed last.
inputs:
  - name: north-america
    metadata: data/ncov_north-america.tar.gz
    sequences: data/ncov_north-america.tar.gz
  - name: washington
    metadata: data/gisaid_auspice_input_hcov-19_2021_05_24_21.tar
    sequences: data/gisaid_auspice_input_hcov-19_2021_05_24_21.tar

# Define builds.
builds:
  washington:
    region: North America
    country: USA
    division: Washington
    subsampling_scheme: focal-contextual

# Define subsampling scheme.
subsampling:
  focal-contextual:
    focal:
      query: --query "division == '{division}'"
      max_sequences: 20
    contextual:
      query: --query "division != '{division}'"
      max_sequences: 20
      group_by: region year month
      priorities:
        type: proximity
        focus: focal

Internally, the sanitize_sequences.py script extracts the first .tsv or .fasta file from a given (compressed or uncompressed) tarball, decompresses that file as needed, and returns the corresponding buffer to be consumed downstream.

@huddlej huddlej marked this pull request as ready for review May 24, 2021 23:28
Copy link
Member

@rneher rneher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks John. I left one comment about the pandas.drop_duplicates

scripts/sanitize_metadata.py Outdated Show resolved Hide resolved
@huddlej huddlej force-pushed the support-default-gisaid-metadata branch 3 times, most recently from e45903e to fba1fac Compare May 27, 2021 15:26
huddlej added 5 commits May 27, 2021 15:04
Adds configuration parameters and new arguments/flags to the sanitize
metadata and sequences scripts to convert default GISAID metadata and
sequences into a format expected by our workflows. The new sanitize
operations for metadata include renaming specific fields, parsing the
single location field into separate geographic scale fields (region,
country, etc.), replacing whitespace in strain names, and resolving
duplicate records.

When resolving duplicate records, the script sorts records by strain
name and all available database accession fields, groups by strain name,
and takes the last record of each group. This approach allows us to
handle cases like GenBank metadata which include the GISAID accession
column but may be missing data for specific records in that column. In
the case where accessions are missing in all columns, this approach
defaults to the sane default of picking the last record per strain.

The new operations applied during metadata sanitization appear in the
script's help in the order that they are applied. The script's usage
text also reflects that the available operations are applied in the
order they appear in the help list. This order of operations is
important because some operations (e.g., renaming fields) change values
that other operations could depend on (e.g., parsing location field).

We sanitize default GISAID sequences by replacing whitespace in strain
names, stripping out additional metadata that appears in the FASTA
defline, and dropping duplicate sequences to avoid errors downstream in
the workflow. This deduplication is especially important when the
workflow runs with a single input and the "combine and dedup" step does
not run on the inputs. This implementation copies the deduplication
logic of that combine and dedup script into a new function that could
eventually be ported into the Augur `io` module.

Finally, we add support for reading data from GISAID tarballs. GISAID
provides tarballs (e.g., `.tar.gz` or `.tar.xz`) for packages available
through their "Download" interface. These tarballs typically include a
README with the GISAID terms and conditions and a metadata file (`.tsv`)
or a sequence file (`.fasta`). This commit adds a utility function to
look for one of these file types in a tarball and updates the metadata
sanitizer to use this function when a tarball is provided as the
metadata file.

GISAID provides tarballs in different formats including gzip-compressed
tarballs (`.tar.gz`) with uncompressed data inside (`.fasta`) and
uncompressed tarballs (`.tar`) with LZMA-compressed data
inside (`.fasta.xz`). To handle the case of compressed data inside the
tar file, we need to explicitly decompress those data with the Python
LZMA library before trying to process the data. Additionally, sequence
data need to be decoded prior to consumption by BioPython while metadata
can be consumed by pandas without any prior decoding.
Adds checks to the adjust metadata script for exposure columns before attempting
to use those columns. These columns exist in the Nextstrain metadata but not in
GISAID metadata.
Now that the sequence sanitizer script has to handle duplicates anyway,
we no longer need to pipe its output to a separate script that does the
same thing. This commit also pipes output from the sanitizer script to
xz to speed up compression.
@huddlej huddlej force-pushed the support-default-gisaid-metadata branch from fba1fac to e153db3 Compare May 27, 2021 22:07
@huddlej
Copy link
Contributor Author

huddlej commented May 27, 2021

Based on conversation in Slack, I opted to not standardize column names in the metadata in favor of the simpler approach of manually renaming all columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants