Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include ids for taxon-name #68

Open
myrmoteras opened this issue Nov 2, 2023 · 15 comments
Open

include ids for taxon-name #68

myrmoteras opened this issue Nov 2, 2023 · 15 comments
Assignees

Comments

@myrmoteras
Copy link
Collaborator

annotation id

when CoL;

  • vocab-id
  • term-id
@myrmoteras
Copy link
Collaborator Author

@flsimoes
Copy link

flsimoes commented Nov 2, 2023

image

@tcatapano
Copy link
Member

tcatapano commented Nov 2, 2023

TaxPub: use object-id
JATS 1.3: use named-content with @vocab-identifier and @vocab-term-identifier

note pending issue for addition of vocab attrs to tp:taxon-name plazi/TaxPub#83

@flsimoes
Copy link

flsimoes commented Nov 3, 2023

@myrmoteras Here's a CoL ID example

What Guido explained to me is that these have been added to new uploads since January 2023 and are now being retroactively added to the backlog via the Big Batch.

image

@flsimoes
Copy link

flsimoes commented Nov 3, 2023

@tcatapano
Copy link
Member

tcatapano commented Nov 3, 2023

Using current TaxPub markup, <object-type> (https://taxpub.org/v1-0/taglibrary/index.html#p=elem-object-id) in <taxon-name> seems like the best option ; so:

<tp:taxon-name>
<tp:taxon-name>
<object-id content-type="taxon-name" object-id-type="col-id">BP8N3</object-id>
B. abyssinica subsp. abyssinica</tp:taxon-name>

@tcatapano
Copy link
Member

Current sample does not contain CoL ids, so this issue will have to wait for development

@myrmoteras
Copy link
Collaborator Author

@flsimoes @gsautter can you explain, why in this example the COL IDS are missing? is adding COL ID something new and being done as part of the batch? I think to remember that COL IDs only got stable sometimes during this year?

@gsautter
Copy link
Collaborator

gsautter commented Nov 28, 2023

@flsimoes @gsautter can you explain, why in this example the COL IDS are missing? is adding COL ID something new and being done as part of the batch? I think to remember that COL IDs only got stable sometimes during this year?

Which example are you talking about?

It's true that we've only been linking the treatment taxa and cited taxa to CoL since the start of this year, mainly, as you correctly say, because the CoL name IDs have only been stable since late 2022.
The linking of new articles is part of our standard batch processing now (since early October), and the old IMFs have been linked by the Big Batch. On top of that, whenever a treatment comes into SRS (new or as an update) and the taxon or a cited taxon are missing the CoL ID, there is a lookup and the ID gets added at that point and written back to the IMF via the link write-back mechanism. All these in-routes for the links are alternatives, and all lead to the same result.

The linking of treatments that come into SRS originally was sort of preempting the Big Batch, and also serves as a means of adding the links after the fact, as at the time of the original IMF import, CoL might not have an ID for a given name just yet (and how could it in case of a newly published original description or new combination) ... this is why the linking on the way into SRS will stay active.

@flsimoes
Copy link

flsimoes commented Nov 29, 2023

I'm guessing Donat refers to Terry's "Current sample does not contain CoL ids" which, if memory serves, is from the list of papers I sent him during the last sprint, which means pre-Big Batch.

@gsautter
Copy link
Collaborator

I'm guessing Donat refers to Terry's "Current sample does not contain CoL ids" which, if memory serves, is from the list of papers I sent him during the last sprint, which means pre-Big Batch.

Well, in that light, from more recent memory (Geneva in early November), the test set for articles is mainly focused on documents that don't contain treatments ...
But since all the current linking mechanisms only ever go at treatment taxa and cited taxa (treatment citations and type taxa), it should be easy to extrapolate that non-treatment names don't get linked right now ... which explains why there is no CoL links in documents that don't have treatments in them.

We can change that policy, of course, but please keep in mind that the taxon names that currently don't get linked are also subject to far less scrutiny in QC, and have lower error severities as well, so linking might not be just as reliable if we don't also change (increase) outside-treatment taxon name QC.

@myrmoteras
Copy link
Collaborator Author

only treatment taxa are linked with a COL-ID

all the rest is not linked.

  • shall we link? Do we insert a lot of false positives?
  • could be linked, but creates a lot of computing effort

@gsautter
Copy link
Collaborator

Taxon names in treatment citations are linked as well, as are type species ... the only thing strictly restricted to treatment taxa is the additional link to the ENA/NCBI taxonomic backbone.

The cited effort mainly pertains to a huge number of API lookups (to ChecklistBank), not actual computations to be made ... to reduce the barrage towards the ChecklistBank API is the main objective of the current restriction to treatment taxa, treatment citations, and the likes of type species.

@myrmoteras
Copy link
Collaborator Author

@gsautter I don't understand the argument about lookups. I thought we have a local version of CLB and especially COL and thus would not have to use external lookups?

@gsautter
Copy link
Collaborator

@gsautter I don't understand the argument about lookups. I thought we have a local version of CLB and especially COL and thus would not have to use external lookups?

We do have a local version of CoL, yes, built every year from the annual version ... however, CLB might well get ahead, so a miss in the local CoL needs following up with a CLB lookup.

The ENA taxon ID is yet another thing, and always requires a CLB lookup, as said mapping is subject to change to too high a degree to include it in CoL local, and it would also inflate the data structure, and needlessly so for all applications except for this specific type of lookup ... I'm always thinking about Jeremy's laptop in this sort of context: CoL local has to stay sufficiently slim to work on such machines ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants