Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: simplify NCBI Datasets fields config #19

Merged
merged 1 commit into from
Nov 29, 2023

Conversation

joverlee521
Copy link
Contributor

Description of proposed changes

Instead of hard-coding the list of NCBI Datasets fields in the workflow, just provide the list via the default config. This makes it easy to customize which fields to include and makes it very obvious that field_map config for the the curation pipeline is changing the names of these NCBI fields.

This includes a change in the format_ncbi_dataset_report rule to use the provided fields as the header so that we do not have to do a separate renaming of the NCBI column names back to the computer friendly mneumonics.

Related issue(s)

Prompted by my review of dengue ingest PR nextstrain/dengue#13 (comment)

Checklist

  • Checks pass

Instead of hard-coding the list of NCBI Datasets fields in the workflow,
just provide the list via the default config. This makes it easy to
customize which fields to include and makes it very obvious that
field_map config for the the curation pipeline is changing the names of
these NCBI fields.

This includes a change in the `format_ncbi_dataset_report` rule to use
the provided fields as the header so that we do not have to do a
separate renaming of the NCBI column names back to the computer friendly
mneumonics.
Comment on lines +48 to +63
accession: accession
sourcedb: database
sra-accs: sra_accessions
isolate-lineage: strain
geo-region: region
geo-location: location
isolate-collection-date: date
release-date: date_released
update-date: date_updated
length: length
host-name: host
isolate-lineage-source: sample_type
biosample-acc: biosample_accessions
submitter-names: authors
submitter-affiliation: institution
submitter-country: submitter_country
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will most likely change in the near future when we've resolved #20

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non blocking comment]

I lean toward changing accession: accession to accession: genbank_accession to be specific for NCBI data. Therefore, we can define downstream merging of NCBI and other (or private) data to generate a harmonized accession record ID.

@joverlee521 joverlee521 requested a review from j23414 November 28, 2023 01:38
Copy link
Contributor

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction, since defining NCBI field map for the final column names in the config seems more efficient and explicit than maintaining a separate file (e.g. config/ncbi-field-map.tsv).

@j23414 j23414 mentioned this pull request Nov 28, 2023
2 tasks
@joverlee521 joverlee521 marked this pull request as ready for review November 29, 2023 23:15
@joverlee521 joverlee521 merged commit 87a1204 into main Nov 29, 2023
@joverlee521 joverlee521 deleted the simplify-ncbi-fields branch November 29, 2023 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants