-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support default GISAID metadata and sequences #640
Conversation
Apologies if this is a naive question, is this also stripping out |
@emmahodcroft Yeah, we actually do that as part of the current sanitize scripts based on these config parameters. We had to make that change to get the "Augur input" format downloads to work with our standard include/exclude files and then this PR builds on those existing scripts. |
Thanks John! I had thought I'd seen this somewhere before, but I just couldn't remember exactly where. Apologies for the repeat! |
ad43a19
to
ee3c1f7
Compare
This PR implements duplicate resolution for metadata using a procedure described as a possible improvement for Augur's own metadata reader. If this approach to resolving duplicates seems reasonable, we can port this function into Augur. This PR does not implement duplicate resolution for sequences (at least, not yet!). |
5579b71
to
3d315e5
Compare
After some further modifications to the workflow, we can now run builds like the following that use GISAID tarballs directly as metadata and sequences inputs: # Define inputs with preferred sequences/metadata listed last.
inputs:
- name: north-america
metadata: data/ncov_north-america.tar.gz
sequences: data/ncov_north-america.tar.gz
- name: washington
metadata: data/gisaid_auspice_input_hcov-19_2021_05_24_21.tar
sequences: data/gisaid_auspice_input_hcov-19_2021_05_24_21.tar
# Define builds.
builds:
washington:
region: North America
country: USA
division: Washington
subsampling_scheme: focal-contextual
# Define subsampling scheme.
subsampling:
focal-contextual:
focal:
query: --query "division == '{division}'"
max_sequences: 20
contextual:
query: --query "division != '{division}'"
max_sequences: 20
group_by: region year month
priorities:
type: proximity
focus: focal Internally, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thanks John. I left one comment about the pandas.drop_duplicates
e45903e
to
fba1fac
Compare
Adds configuration parameters and new arguments/flags to the sanitize metadata and sequences scripts to convert default GISAID metadata and sequences into a format expected by our workflows. The new sanitize operations for metadata include renaming specific fields, parsing the single location field into separate geographic scale fields (region, country, etc.), replacing whitespace in strain names, and resolving duplicate records. When resolving duplicate records, the script sorts records by strain name and all available database accession fields, groups by strain name, and takes the last record of each group. This approach allows us to handle cases like GenBank metadata which include the GISAID accession column but may be missing data for specific records in that column. In the case where accessions are missing in all columns, this approach defaults to the sane default of picking the last record per strain. The new operations applied during metadata sanitization appear in the script's help in the order that they are applied. The script's usage text also reflects that the available operations are applied in the order they appear in the help list. This order of operations is important because some operations (e.g., renaming fields) change values that other operations could depend on (e.g., parsing location field). We sanitize default GISAID sequences by replacing whitespace in strain names, stripping out additional metadata that appears in the FASTA defline, and dropping duplicate sequences to avoid errors downstream in the workflow. This deduplication is especially important when the workflow runs with a single input and the "combine and dedup" step does not run on the inputs. This implementation copies the deduplication logic of that combine and dedup script into a new function that could eventually be ported into the Augur `io` module. Finally, we add support for reading data from GISAID tarballs. GISAID provides tarballs (e.g., `.tar.gz` or `.tar.xz`) for packages available through their "Download" interface. These tarballs typically include a README with the GISAID terms and conditions and a metadata file (`.tsv`) or a sequence file (`.fasta`). This commit adds a utility function to look for one of these file types in a tarball and updates the metadata sanitizer to use this function when a tarball is provided as the metadata file. GISAID provides tarballs in different formats including gzip-compressed tarballs (`.tar.gz`) with uncompressed data inside (`.fasta`) and uncompressed tarballs (`.tar`) with LZMA-compressed data inside (`.fasta.xz`). To handle the case of compressed data inside the tar file, we need to explicitly decompress those data with the Python LZMA library before trying to process the data. Additionally, sequence data need to be decoded prior to consumption by BioPython while metadata can be consumed by pandas without any prior decoding.
Adds checks to the adjust metadata script for exposure columns before attempting to use those columns. These columns exist in the Nextstrain metadata but not in GISAID metadata.
Now that the sequence sanitizer script has to handle duplicates anyway, we no longer need to pipe its output to a separate script that does the same thing. This commit also pipes output from the sanitizer script to xz to speed up compression.
fba1fac
to
e153db3
Compare
Based on conversation in Slack, I opted to not standardize column names in the metadata in favor of the simpler approach of manually renaming all columns. |
Description of proposed changes
We would like to support GISAID's standard metadata and sequence data formats from the "Download packages" interface of EpiCoV. This PR expands the existing "sanitize" scripts for metadata and sequences with the following major changes to support these data:
region
,country
,division
, andlocation
values from the singleLocation
field in the GISAID metadata.fasta
) and metadata (.tsv
) files directly from GISAID tarballs (allows users to download data and run the workflow directly on those files without manually extracting/decompressing sequences and metadata)Currently, these changes are applied to all inputs including data that are not in the GISAID default format.
Testing
Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
v7
docs/change_log.md
in this pull request to document these changes and the new version number.