Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: add rule to create url column for accessions #76

Closed
joverlee521 opened this issue Dec 5, 2024 · 8 comments · Fixed by #80
Closed

ingest: add rule to create url column for accessions #76

joverlee521 opened this issue Dec 5, 2024 · 8 comments · Fixed by #80
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Context

Building on discussion in office hours as summarized by @j23414 in #20 (comment):

During office hours today, it was brought up that "genbank_accession" is automatically detected in auspice and generates a GenBank URL for the node callout. However, "accession" is not and does not get a valid link in the node call out. When spiking in non-genbank records (e.g. USVI for zika), we usually generate a url column in the metadata to get a valid link in the node call out.

It was proposed to generate a URL column during ingest workflow, which may subsequently work as-is in the node callout and easier to merge with non-genbank records.

Description

The ingest workflow should include steps to add a new url column.
Existing examples:

@j23414
Copy link
Contributor

j23414 commented Dec 11, 2024

Thinking out loud, there seem to be a couple of ways to implement this:

  1. Create a new rule (which also creates another intermediate file)
    1. example: 436805f - mostly moving the rule in this commit from the phylogenetic to the ingest workflow
    2. Creating a new rule would be consistent to the subset_metadata rule
    3. Would keep the [curate rule] mostly composed of "augur curate" commands
  2. Incorporate the url creation into the existing curate rule
    1. If we add url during the JSON parsing, it requires writing an append-url.py script
      1. Which would trigger adding said script to the vendored repo
      2. Which might trigger adding a general purpose append-x-cols.pyscript?
    2. If we add url after the JSON parsing but within the rule, this looks like creating an intermediate (metadata_curated.tsv?) upon which csvtk is called.

Just checking if there's any strong preferences between the implementation options?

@joverlee521
Copy link
Contributor Author

I think (1) is easier to understand and I'd prefer not to add another custom script...

If we want to add it to the existing curate rule that works with JSONs, we also have the option of adding the new column with jq

echo '{"accession":"123"}' | jq '. |= (.url="https://www.ncbi.nlm.nih.gov/nuccore/" + .accession)'

@j23414 j23414 linked a pull request Dec 11, 2024 that will close this issue
1 task
@genehack
Copy link
Contributor

I think (1) is easier to understand and I'd prefer not to add another custom script...

If we want to add it to the existing curate rule that works with JSONs, we also have the option of adding the new column with jq

echo '{"accession":"123"}' | jq '. |= (.url="https://www.ncbi.nlm.nih.gov/nuccore/" + .accession)'

+1 for option 1 and no additional custom scripts — the yellow fever version of this was easy to add

@j23414
Copy link
Contributor

j23414 commented Dec 12, 2024

After learning about it, I kinda like the jq suggestion as it bypasses writing and tracking yet-another custom script, as well as bypasses generating yet-another intermediate file. Checking if anyone feels there are drawbacks to using jq (e.g. are jq statements too difficult to understand and maintain)? I've pasted example jq implementations below

@genehack
Copy link
Contributor

Checking if anyone feels there are drawbacks to using jq (e.g. are jq statements too difficult to understand and maintain)?

Personally, I am not a huge fan of jq. We're already extensively using cvstk, and cvstk can be used to build the url column, so I think the question (again, from my POV) is more "why would we not use csvtk?"

@j23414
Copy link
Contributor

j23414 commented Dec 12, 2024

"why would we not use csvtk?"

My push-back againstcsvtk in this case is the creation of yet-another intermediate file. Could you elaborate on what experiences have led you to not be a huge fan of jq? I ask because I haven't used jq extensively so I'm wondering what the draw backs of the tool are (e.g. limited streaming, slowness in certain situations, memory limits, changes its api often, conflicts between jq variable names and snakemake or python or nextstrain-cli, etc...?). This also has the added benefit that I'll have an argument for or against using jq in subsequent code bases.

We might be able to satisfy both requirements (1) use csvtk and (2) bypass creation of yet-another-intermediate file by adding it to the "subset_metadata" rule:

But let me know if I'm the only one being a sticker on "bypass creation of yet-another intermediate file" (what preferred name do we want?), I'm willing to accept the memory bloat (yes I know temp() files exist) and the subsequent decision on naming a new file. I'm still hopeful of finding a solution that satisfies all expressed requirements (especially if this solution gets propagated across several repositories) but compromises are also acceptable. I'm mostly asking for more details

@genehack
Copy link
Contributor

"why would we not use csvtk?"

My push-back againstcsvtk in this case is the creation of yet-another intermediate file.

If that's a concern, temp() or integrating the csvtk steps into the existing curate rule cascade are both approaches that would work.

Could you elaborate on what experiences have led you to not be a huge fan of jq?

The syntax for it doesn't stick in my brain, and I constantly have to look it up; it's another, slightly different set of things to remember and understand, but it's not jq specifically I'm objecting to, it's adding yet another tool. I'd push back on sed in the same way.

@j23414
Copy link
Contributor

j23414 commented Dec 13, 2024

Okay, for the sake of:

  1. not adding yet-another tool
  2. the mental strain caused by incongruence between csvtk and jq syntax
  3. not confusing people by adding csvtk to existing csvtk calls in subset_metadata
  4. avoiding a json->csv->json transformation in the curate rule

I will revert back to option 1: 436805f

I will also try into incorporate the suggestion of defining an explicit genbank accession config value.
Thanks for the discussion all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants