ingest: simplify NCBI Datasets fields config #19

joverlee521 · 2023-11-28T01:35:32Z

Description of proposed changes

Instead of hard-coding the list of NCBI Datasets fields in the workflow, just provide the list via the default config. This makes it easy to customize which fields to include and makes it very obvious that field_map config for the the curation pipeline is changing the names of these NCBI fields.

This includes a change in the format_ncbi_dataset_report rule to use the provided fields as the header so that we do not have to do a separate renaming of the NCBI column names back to the computer friendly mneumonics.

Related issue(s)

Prompted by my review of dengue ingest PR nextstrain/dengue#13 (comment)

Checklist

Checks pass

Instead of hard-coding the list of NCBI Datasets fields in the workflow, just provide the list via the default config. This makes it easy to customize which fields to include and makes it very obvious that field_map config for the the curation pipeline is changing the names of these NCBI fields. This includes a change in the `format_ncbi_dataset_report` rule to use the provided fields as the header so that we do not have to do a separate renaming of the NCBI column names back to the computer friendly mneumonics.

joverlee521 · 2023-11-28T01:37:58Z

ingest/config/defaults.yaml

+    accession: accession
+    sourcedb: database
+    sra-accs: sra_accessions
+    isolate-lineage: strain
+    geo-region: region
+    geo-location: location
+    isolate-collection-date: date
+    release-date: date_released
+    update-date: date_updated
+    length: length
+    host-name: host
+    isolate-lineage-source: sample_type
+    biosample-acc: biosample_accessions
+    submitter-names: authors
+    submitter-affiliation: institution
+    submitter-country: submitter_country


This will most likely change in the near future when we've resolved #20

[non blocking comment]

I lean toward changing accession: accession to accession: genbank_accession to be specific for NCBI data. Therefore, we can define downstream merging of NCBI and other (or private) data to generate a harmonized accession record ID.

j23414

I like this direction, since defining NCBI field map for the final column names in the config seems more efficient and explicit than maintaining a separate file (e.g. config/ncbi-field-map.tsv).

joverlee521 commented Nov 28, 2023

View reviewed changes

joverlee521 requested a review from j23414 November 28, 2023 01:38

j23414 approved these changes Nov 28, 2023

View reviewed changes

j23414 mentioned this pull request Nov 28, 2023

Copy ingest nextstrain/dengue#13

Merged

2 tasks

joverlee521 mentioned this pull request Nov 29, 2023

Standardize output metadata fields for Nextstrain ingest #20

Open

joverlee521 marked this pull request as ready for review November 29, 2023 23:15

joverlee521 merged commit 87a1204 into main Nov 29, 2023

joverlee521 deleted the simplify-ncbi-fields branch November 29, 2023 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: simplify NCBI Datasets fields config #19

ingest: simplify NCBI Datasets fields config #19

joverlee521 commented Nov 28, 2023

joverlee521 Nov 28, 2023

j23414 Nov 28, 2023

j23414 left a comment

ingest: simplify NCBI Datasets fields config #19

ingest: simplify NCBI Datasets fields config #19

Conversation

joverlee521 commented Nov 28, 2023

Description of proposed changes

Related issue(s)

Checklist

joverlee521 Nov 28, 2023

Choose a reason for hiding this comment

j23414 Nov 28, 2023

Choose a reason for hiding this comment

j23414 left a comment

Choose a reason for hiding this comment