Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: fix csvtk quotes #58

Merged
merged 1 commit into from
May 2, 2024
Merged

ingest: fix csvtk quotes #58

merged 1 commit into from
May 2, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented May 2, 2024

The automated ingest workflow failed with a csvtk quoting error.¹ Following nextstrain/docker-base#209, we can now use csvtk fix-quotes and csvtk del-quotes to work around the quoting issue.

¹ https://github.com/nextstrain/zika/actions/runs/8926866948/job/24518932039#step:8:139

Checklist

The automated ingest workflow failed with a csvtk quoting error.¹
Following nextstrain/docker-base#209, we can now
use `csvtk fix-quotes` and `csvtk del-quotes` to work around the quoting
issue.

¹ https://github.com/nextstrain/zika/actions/runs/8926866948/job/24518932039#step:8:139
Copy link
Contributor

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me

@joverlee521
Copy link
Contributor Author

The error was caused by a new zika record that had internal quotes in the submitter.affiliation:

{"accession": "OR701943.1", "completeness": "PARTIAL", "host": {"lineage": [{"name": "cellular organisms", "taxId": 131567}, {"name": "Eukaryota", "taxId": 2759}, {"name": "Opisthokonta", "taxId": 33154}, {"name": "Metazoa", "taxId": 33208}, {"name": "Eumetazoa", "taxId": 6072}, {"name": "Bilateria", "taxId": 33213}, {"name": "Protostomia", "taxId": 33317}, {"name": "Ecdysozoa", "taxId": 1206794}, {"name": "Panarthropoda", "taxId": 88770}, {"name": "Arthropoda", "taxId": 6656}, {"name": "Mandibulata", "taxId": 197563}, {"name": "Pancrustacea", "taxId": 197562}, {"name": "Hexapoda", "taxId": 6960}, {"name": "Insecta", "taxId": 50557}, {"name": "Dicondylia", "taxId": 85512}, {"name": "Pterygota", "taxId": 7496}, {"name": "Neoptera", "taxId": 33340}, {"name": "Endopterygota", "taxId": 33392}, {"name": "Diptera", "taxId": 7147}, {"name": "Nematocera", "taxId": 7148}, {"name": "Culicomorpha", "taxId": 43786}, {"name": "Culicoidea", "taxId": 41827}, {"name": "Culicidae", "taxId": 7157}, {"name": "Culicinae", "taxId": 43817}, {"name": "Aedini", "taxId": 1056966}, {"name": "Aedes", "taxId": 7158}, {"name": "Stegomyia", "taxId": 53541}, {"name": "Aedes aegypti", "taxId": 7159}], "organismName": "Aedes aegypti", "taxId": 7159}, "isAnnotated": true, "isolate": {"collectionDate": "2021-11-11", "name": "6PYUC2022"}, "length": 217, "location": {"geographicLocation": "Mexico: Yucatan, Merida", "geographicRegion": "North America"}, "nucleotide": {"sequenceHash": "6FD6033C"}, "proteinCount": 1, "releaseDate": "2024-05-01T00:00:00Z", "sourceDatabase": "GenBank", "submitter.affiliation": "Centro de Investigaciones Regionales \"Dr. Hideyo Noguchi\", Laboratorio de Arbovirologia", "submitter.country": "Mexico", "submitter.names": ["Argaez-Sierra,D.G.", "Baak-Baak,C.M.", "Cigarroa-Toledo,N.", "Garcia-Rejon,J.E.", "Tzuc-Dzul,J.C.", "Acosta-Viana,K.Y.", "Nunez-Corea,D.A."], "updateDate": "2024-05-01T00:00:00Z", "virus": {"lineage": [{"name": "Viruses", "taxId": 10239}, {"name": "Riboviria", "taxId": 2559587}, {"name": "Orthornavirae", "taxId": 2732396}, {"name": "Kitrinoviricota", "taxId": 2732406}, {"name": "Flasuviricetes", "taxId": 2732462}, {"name": "Amarillovirales", "taxId": 2732545}, {"name": "Flaviviridae", "taxId": 11050}, {"name": "Orthoflavivirus", "taxId": 3044782}, {"name": "Orthoflavivirus zikaense", "taxId": 3048459}, {"name": "Zika virus", "taxId": 64320}], "organismName": "Zika virus", "taxId": 64320}}

I confirmed locally that the output for format_ncbi_dataset_report has the correct quoting in submitter-affiliation.

accession	accession-rev	sourcedb	sra-accs	isolate-lineage	geo-region	geo-location	isolate-collection-date	release-date	update-date	length	host-name	isolate-lineage-source	biosample-acc	submitter-names	submitter-affiliation	submitter-country
OR701943	OR701943.1	GenBank		6PYUC2022	North America	Mexico: Yucatan, Merida	2021-11-11	2024-05-01T00:00:00Z	2024-05-01T00:00:00Z	217	Aedes aegypti			Argaez-Sierra,D.G.,Baak-Baak,C.M.,Cigarroa-Toledo,N.,Garcia-Rejon,J.E.,Tzuc-Dzul,J.C.,Acosta-Viana,K.Y.,Nunez-Corea,D.A.	Centro de Investigaciones Regionales "Dr. Hideyo Noguchi", Laboratorio de Arbovirologia	Mexico

The final produced metadata.tsv has double quoting in the institution column, but this is due to an augur curate passthru bug.

genbank_accession	genbank_accession_rev	strain	date	region	country	division	location	length	host	release_date	update_date	sra_accessions	authors	institution
OR701943	OR701943.1	6PYUC2022	2021-11-11	North America	Mexico	Yucatan	Merida	217	Aedes aegypti	2024-05-01	2024-05-01		Argaez-Sierra et al	"Centro de Investigaciones Regionales ""Dr. Hideyo Noguchi"", Laboratorio de Arbovirologia"

@joverlee521
Copy link
Contributor Author

Merging to get our ingest going again, but I'll loop back to the augur curate issue`.

@joverlee521 joverlee521 merged commit 29044f0 into main May 2, 2024
41 checks passed
@joverlee521 joverlee521 deleted the fix-csvtk branch May 2, 2024 18:59
@joverlee521
Copy link
Contributor Author

Manually triggered ingest-to-phylogenetic

j23414 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request May 2, 2024
We can now use `csvtk fix-quotes` and `csvtk del-quotes` to work around
quoting issues (e.g. internal quotes in the submitter.affiliation).

Copied commit from Zika repo:

* nextstrain/zika#58
@j23414 j23414 mentioned this pull request Dec 9, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants