Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools #103

cmungall · 2024-02-15T16:53:00Z

There are some cases of bioregistry "CURIEs" allowing square brackets in the local id. This is questionable if we follow the (IMO frustratingly opaque) W3C specs.

Here are some examples of what is permitted in bioregistry

SMILES; e.g smiles:CC(=O)NC([H])(C)C(=O)O
UCUM; see Discussion about how to improve UCUM bioregistry#648

(it is of course a stretch to call these IDs (biopragmatics/bioregistry#460))

These work perfectly well in the context of bioregistry; clicking on this will resolve to a nice picture of a molecule, which is what most bioregistry users want.

https://bioregistry.io/reference/smiles:CC(=O)NC([H])(C)C(=O)O

Let's see what happens when we try and use this with tooling that actually supports W3C specs:

{
  "@context": {
    "@base": "http://example.org",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "smiles": "https://bioregistry.io/smiles:"
  },
  "@id": "smiles:CC(=O)NC([H])(C)C(=O)O",
  "@type": "Molecule",
  "rdfs:label": "Acetaminophen"
}

using Jena:

riot --strict smiles.jsonld
16:33:06 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:33:06 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Molecule> .
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/2000/01/rdf-schema#label> "Acetaminophen" .

not pretty.. but it does process it, even in strict mode

however, it refuses to validate it

riot --validate smiles.jsonld || echo fail
16:38:10 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:38:10 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
fail

In contrast, https://json-ld.org/playground/ does not complain

I suspect the rust toolchains are stricter

Removing or escaping the []s allows it to validate (note that ()s are frequently URL encoded but they are still valid)

What are our options?

Make curies always strict. Forbid [] or encodings thereof. These are poor choices for bona-fide IDs. Don't try and overload the CURIE concept for languages like HGVS, UCUM, SMILES, InChi, etc
go your own way. Explicitly document that curies isn't for CURIEs as defined by W3C specs, it's just prefixed IDs that expand to URLs that work in browsers with no commitments to any specifications outside those in this repo.
Make curies conform to W3C specs, and force []s to be encoded (as the UOM people are doing for UCUM, Discussion about how to improve UCUM bioregistry#648). This could retroactively break things, and confuse people who want to use curies in its intended YOLO fashion
Attempt some formalization where we have loose CURIEs and strict CURIEs and a formal mapping between them (basically URL encoding []s, probably spaces while we are at it)

I think these are all horrible but then I've always said the decision to couple identifiers to networking protocols was a terrible one.

I think 4 is likely the most practical, but this will take some careful planning. There will essentially be the following transforms:

 looseCURIE <-> strictCURIE
    ^.     \.  /.    ^
    |        X       |
    v      /  \.     v
 looseURI   <-> strictURI

(likely implemented with flags on existing expand/contract, with new methods for like-to-like)

What is annoying is that there is AFAICT no way to get json-ld-contexts to specify the diagonal conversion

The text was updated successfully, but these errors were encountered:

cthoyt · 2024-02-27T12:39:33Z

@cmungall thanks for the comment.

I don't think that making this package strict by default will make many people happy, almost everyone in this space is in YOLO mode.

However, CURIEs can be used in both a "correct" way and an incorrect way, this is a choice of the user. We can try and help them make better choices by providing an alternate implementation of the Converter class that follows strict rules and also provides some appropriate utilities for encoding CURIEs

cmungall mentioned this issue Feb 26, 2024

Discussion about how to improve UCUM biopragmatics/bioregistry#648

Open

cthoyt mentioned this issue Feb 28, 2024

Add option for checking w3c specification on expand() #104

Open

cmungall mentioned this issue Mar 4, 2024

Use SPARQL production rules for prefixed and unprefixed identifiers owlcollab/oboformat#150

Open

cmungall mentioned this issue May 18, 2024

LinkML validator should provide a way to verify that entity references are correct linkml/linkml#2116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools #103

Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools #103

cmungall commented Feb 15, 2024 •

edited by cthoyt

Loading

cthoyt commented Feb 27, 2024

Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools #103

Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools #103

Comments

cmungall commented Feb 15, 2024 • edited by cthoyt Loading

cthoyt commented Feb 27, 2024

cmungall commented Feb 15, 2024 •

edited by cthoyt

Loading