Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixup: Add date annotations for rare genotypes #38

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kimandrews
Copy link
Contributor

@kimandrews kimandrews commented Jun 10, 2024

Description of proposed changes

This PR adds collection dates to the ingest metadata output for six samples.

These samples were force-included in the Nextclade dataset tree to increase the representation of rare genotypes in the tree. However, these samples have empty date fields in the metadata output from NCBI Datasets. This results in the samples being removed by the TreeTime clock filter.

Fortunately, the NCBI metadata includes strain names for these six samples, and the collection dates can be extracted from the strain names.

This PR adds the collection dates (which were extracted manually from the strain names) for the six samples to ingest/defaults/annotations.tsv, which results in collection dates being included in the ingest metadata output, and also results in the samples being included by TreeTime in the Nextclade dataset tree.

Related issue(s)

#28

Checklist

  • Checks pass

@kimandrews kimandrews requested a review from a team June 10, 2024 22:37
Six of the samples that are force-included in the Nextclade dataset tree have empty collection date fields in the metadata output from NCBI Datasets. This results in the samples being removed downstream by the TreeTime clock filter. This commit adds collection dates (which were manually extracted from the strain names in the NCBI metadata) for these samples so that they will be included in the Nextclade dataset tree.
@kimandrews kimandrews force-pushed the add-rare-genotype-annotations branch from cd15009 to 7c2776d Compare June 14, 2024 17:51
#
# Strains with rare genotypes
# Dates are retrieved from epi-weeks reported within strain names on NCBI
# Dates are defined as the first day of the epi-week
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

"first day" is somewhat ambiguous — could be Sunday, could be Monday… Better be explicit.

Suggested change
# Dates are defined as the first day of the epi-week
# Dates are defined as the Monday of the epi-week

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this needs to be more explicit. There are many different definitions for epi-weeks, and so the most precise wording for what I did would be "Dates are defined as the first day of the ISO epi-week, which is always a Monday". I can add this info to the annotations.tsv file. It also may be worth discussing whether there is a better approach for defining dates from epi-weeks reported in measles strain names. I started a discussion about this in slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants