-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest: simplify NCBI Datasets fields config #19
Conversation
Instead of hard-coding the list of NCBI Datasets fields in the workflow, just provide the list via the default config. This makes it easy to customize which fields to include and makes it very obvious that field_map config for the the curation pipeline is changing the names of these NCBI fields. This includes a change in the `format_ncbi_dataset_report` rule to use the provided fields as the header so that we do not have to do a separate renaming of the NCBI column names back to the computer friendly mneumonics.
accession: accession | ||
sourcedb: database | ||
sra-accs: sra_accessions | ||
isolate-lineage: strain | ||
geo-region: region | ||
geo-location: location | ||
isolate-collection-date: date | ||
release-date: date_released | ||
update-date: date_updated | ||
length: length | ||
host-name: host | ||
isolate-lineage-source: sample_type | ||
biosample-acc: biosample_accessions | ||
submitter-names: authors | ||
submitter-affiliation: institution | ||
submitter-country: submitter_country |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will most likely change in the near future when we've resolved #20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non blocking comment]
I lean toward changing accession: accession
to accession: genbank_accession
to be specific for NCBI data. Therefore, we can define downstream merging of NCBI and other (or private) data to generate a harmonized accession
record ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this direction, since defining NCBI field map for the final column names in the config seems more efficient and explicit than maintaining a separate file (e.g. config/ncbi-field-map.tsv
).
Description of proposed changes
Instead of hard-coding the list of NCBI Datasets fields in the workflow, just provide the list via the default config. This makes it easy to customize which fields to include and makes it very obvious that field_map config for the the curation pipeline is changing the names of these NCBI fields.
This includes a change in the
format_ncbi_dataset_report
rule to use the provided fields as the header so that we do not have to do a separate renaming of the NCBI column names back to the computer friendly mneumonics.Related issue(s)
Prompted by my review of dengue ingest PR nextstrain/dengue#13 (comment)
Checklist