ingest: add rule to create `url` column for accessions #76

joverlee521 · 2024-12-05T21:14:54Z

Context

Building on discussion in office hours as summarized by @j23414 in #20 (comment):

During office hours today, it was brought up that "genbank_accession" is automatically detected in auspice and generates a GenBank URL for the node callout. However, "accession" is not and does not get a valid link in the node call out. When spiking in non-genbank records (e.g. USVI for zika), we usually generate a url column in the metadata to get a valid link in the node call out.

It was proposed to generate a URL column during ingest workflow, which may subsequently work as-is in the node callout and easier to merge with non-genbank records.

Description

The ingest workflow should include steps to add a new url column.
Existing examples:

The text was updated successfully, but these errors were encountered:

j23414 · 2024-12-11T17:48:45Z

Thinking out loud, there seem to be a couple of ways to implement this:

Create a new rule (which also creates another intermediate file)
1. example: 436805f - mostly moving the rule in this commit from the phylogenetic to the ingest workflow
2. Creating a new rule would be consistent to the subset_metadata rule
3. Would keep the [curate rule] mostly composed of "augur curate" commands
Incorporate the url creation into the existing curate rule
1. If we add url during the JSON parsing, it requires writing an append-url.py script
  1. Which would trigger adding said script to the vendored repo
  2. Which might trigger adding a general purpose append-x-cols.pyscript?
2. If we add url after the JSON parsing but within the rule, this looks like creating an intermediate (metadata_curated.tsv?) upon which csvtk is called.

Just checking if there's any strong preferences between the implementation options?

joverlee521 · 2024-12-11T18:18:33Z

I think (1) is easier to understand and I'd prefer not to add another custom script...

If we want to add it to the existing curate rule that works with JSONs, we also have the option of adding the new column with jq

echo '{"accession":"123"}' | jq '. |= (.url="https://www.ncbi.nlm.nih.gov/nuccore/" + .accession)'

genehack · 2024-12-11T23:53:17Z

I think (1) is easier to understand and I'd prefer not to add another custom script...

If we want to add it to the existing curate rule that works with JSONs, we also have the option of adding the new column with jq
echo '{"accession":"123"}' | jq '. |= (.url="https://www.ncbi.nlm.nih.gov/nuccore/" + .accession)'

+1 for option 1 and no additional custom scripts — the yellow fever version of this was easy to add

j23414 · 2024-12-12T16:16:51Z

After learning about it, I kinda like the jq suggestion as it bypasses writing and tracking yet-another custom script, as well as bypasses generating yet-another intermediate file. Checking if anyone feels there are drawbacks to using jq (e.g. are jq statements too difficult to understand and maintain)? I've pasted example jq implementations below

For guide: 58a9a99
For zika: Ingest: Derive url and use accession fields during ingest zika#78

genehack · 2024-12-12T18:51:56Z

Checking if anyone feels there are drawbacks to using jq (e.g. are jq statements too difficult to understand and maintain)?

Personally, I am not a huge fan of jq. We're already extensively using cvstk, and cvstk can be used to build the url column, so I think the question (again, from my POV) is more "why would we not use csvtk?"

j23414 · 2024-12-12T19:34:40Z

"why would we not use csvtk?"

My push-back againstcsvtk in this case is the creation of yet-another intermediate file. Could you elaborate on what experiences have led you to not be a huge fan of jq? I ask because I haven't used jq extensively so I'm wondering what the draw backs of the tool are (e.g. limited streaming, slowness in certain situations, memory limits, changes its api often, conflicts between jq variable names and snakemake or python or nextstrain-cli, etc...?). This also has the added benefit that I'll have an argument for or against using jq in subsequent code bases.

We might be able to satisfy both requirements (1) use csvtk and (2) bypass creation of yet-another-intermediate file by adding it to the "subset_metadata" rule:

Here:

pathogen-repo-guide/ingest/rules/curate.smk

Line 129 in a443038

csvtk cut -t -f {params.metadata_fields} \

But let me know if I'm the only one being a sticker on "bypass creation of yet-another intermediate file" (what preferred name do we want?), I'm willing to accept the memory bloat (yes I know temp() files exist) and the subsequent decision on naming a new file. I'm still hopeful of finding a solution that satisfies all expressed requirements (especially if this solution gets propagated across several repositories) but compromises are also acceptable. I'm mostly asking for more details

genehack · 2024-12-12T20:26:07Z

"why would we not use csvtk?"

My push-back againstcsvtk in this case is the creation of yet-another intermediate file.

If that's a concern, temp() or integrating the csvtk steps into the existing curate rule cascade are both approaches that would work.

Could you elaborate on what experiences have led you to not be a huge fan of jq?

The syntax for it doesn't stick in my brain, and I constantly have to look it up; it's another, slightly different set of things to remember and understand, but it's not jq specifically I'm objecting to, it's adding yet another tool. I'd push back on sed in the same way.

j23414 · 2024-12-13T16:00:13Z

Okay, for the sake of:

not adding yet-another tool
the mental strain caused by incongruence between csvtk and jq syntax
not confusing people by adding csvtk to existing csvtk calls in subset_metadata
avoiding a json->csv->json transformation in the curate rule

I will revert back to option 1: 436805f

I will also try into incorporate the suggestion of defining an explicit genbank accession config value.
Thanks for the discussion all!

joverlee521 added the enhancement New feature or request label Dec 5, 2024

joverlee521 mentioned this issue Dec 5, 2024

ingest: include url field nextstrain/mpox#76

Open

j23414 mentioned this issue Dec 9, 2024

Move url and accession column generation to ingest nextstrain/zika#77

Open

j23414 mentioned this issue Dec 11, 2024

Ingest: Derive URL column during ingest #80

Merged

1 task

j23414 linked a pull request Dec 11, 2024 that will close this issue

Ingest: Derive URL column during ingest #80

Merged

1 task

j23414 closed this as completed in #80 Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: add rule to create `url` column for accessions #76

ingest: add rule to create `url` column for accessions #76

joverlee521 commented Dec 5, 2024

j23414 commented Dec 11, 2024

joverlee521 commented Dec 11, 2024

genehack commented Dec 11, 2024

j23414 commented Dec 12, 2024 •

edited

Loading

genehack commented Dec 12, 2024

j23414 commented Dec 12, 2024 •

edited

Loading

genehack commented Dec 12, 2024

j23414 commented Dec 13, 2024 •

edited

Loading

ingest: add rule to create url column for accessions #76

ingest: add rule to create url column for accessions #76

Comments

joverlee521 commented Dec 5, 2024

Context

Description

j23414 commented Dec 11, 2024

joverlee521 commented Dec 11, 2024

genehack commented Dec 11, 2024

j23414 commented Dec 12, 2024 • edited Loading

genehack commented Dec 12, 2024

j23414 commented Dec 12, 2024 • edited Loading

genehack commented Dec 12, 2024

j23414 commented Dec 13, 2024 • edited Loading

ingest: add rule to create `url` column for accessions #76

ingest: add rule to create `url` column for accessions #76

j23414 commented Dec 12, 2024 •

edited

Loading

j23414 commented Dec 12, 2024 •

edited

Loading

j23414 commented Dec 13, 2024 •

edited

Loading