Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize output metadata fields for Nextstrain ingest #20

Open
joverlee521 opened this issue Nov 22, 2023 · 3 comments
Open

Standardize output metadata fields for Nextstrain ingest #20

joverlee521 opened this issue Nov 22, 2023 · 3 comments
Assignees

Comments

@joverlee521
Copy link
Contributor

Originally discussed in Slack

In reviewing dengue ingest, I realized we don't have any central documentation on the standard Nextstrain metadata fields. There's currently SARS-CoV-2 docs on it's metadata fields, but it has a lot of SC2/GISAID specific language. (As it should, since that's what it was written for!)
Should we add a metadata section to our top level Data formats page that is focused on standard fields that we use from public data/NCBI? I think with standardizing ingest, we should also have a set of standard metadata fields that is expected for all of our public pathogen metadata. Each pathogen will most likely have additional pathogen-specific fields on top of the standard fields and they would be documented within individual pathogen repos.


Next steps

  1. Compare the fields used in our current public metadata TSVs and propose a list of standardized fields.
  • ncov - s3://nextstrain-data/files/ncov/open/metadata.tsv.zst
  • mpox - s3://nextstrain-data/files/workflows/mpox/metadata.tsv.gz
  • rsv - s3://nextstrain-data/files/workflows/rsv/a/metadata.tsv.gz
  1. Define schema for the metadata TSV. @tsibley suggested that this can be a constrained form of JSON Schema or we can look into existing tabular schemas.
@joverlee521 joverlee521 transferred this issue from nextstrain/ingest Nov 28, 2023
@j23414 j23414 mentioned this issue Nov 28, 2023
2 tasks
@joverlee521
Copy link
Contributor Author

As @j23414 commented in a separate PR, we've had internal discussions on whether the NCBI accession column should be
accession or genbank_accession.

The general consensus in a previous dev chat was that it's better to be more specific and use genbank_accession. However if the data needs to be merged with private data or data from other sources, then there needs to be a general accession column.

@joverlee521
Copy link
Contributor Author

Brought up in Nextstrain office hours today to consider standardizing with NCBI standards in mind, specifically with regards to the One Health Enteric Metadata standards.

@huddlej also pointed to the PHAGE SARS-CoV-2 metadata standards as another potential reference.

@j23414
Copy link
Contributor

j23414 commented Dec 5, 2024

Feel free to drastically edit text for clarity, I'm just jotting down notes here

During office hours today, it was brought up that "genbank_accession" is automatically detected in auspice and generates a GenBank URL for the node callout. However, "accession" is not and does not get a valid link in the node call out. When spiking in non-genbank records (e.g. USVI for zika), we usually generate a url column in the metadata to get a valid link in the node call out.

It was proposed to generate a URL column during ingest workflow, which may subsequently work as-is in the node callout and easier to merge with non-genbank records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants