You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some cases of bioregistry "CURIEs" allowing square brackets in the local id. This is questionable if we follow the (IMO frustratingly opaque) W3C specs.
Here are some examples of what is permitted in bioregistry
These work perfectly well in the context of bioregistry; clicking on this will resolve to a nice picture of a molecule, which is what most bioregistry users want.
riot --strict smiles.jsonld
16:33:06 WARN riot :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:33:06 WARN riot :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Molecule> .
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/2000/01/rdf-schema#label> "Acetaminophen" .
not pretty.. but it does process it, even in strict mode
however, it refuses to validate it
riot --validate smiles.jsonld || echo fail
16:38:10 WARN riot :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:38:10 WARN riot :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
fail
Removing or escaping the []s allows it to validate (note that ()s are frequently URL encoded but they are still valid)
What are our options?
Make curies always strict. Forbid [] or encodings thereof. These are poor choices for bona-fide IDs. Don't try and overload the CURIE concept for languages like HGVS, UCUM, SMILES, InChi, etc
go your own way. Explicitly document that curies isn't for CURIEs as defined by W3C specs, it's just prefixed IDs that expand to URLs that work in browsers with no commitments to any specifications outside those in this repo.
Make curies conform to W3C specs, and force []s to be encoded (as the UOM people are doing for UCUM, Discussion about how to improve UCUM bioregistry#648). This could retroactively break things, and confuse people who want to use curies in its intended YOLO fashion
Attempt some formalization where we have loose CURIEs and strict CURIEs and a formal mapping between them (basically URL encoding []s, probably spaces while we are at it)
I think these are all horrible but then I've always said the decision to couple identifiers to networking protocols was a terrible one.
I think 4 is likely the most practical, but this will take some careful planning. There will essentially be the following transforms:
looseCURIE <-> strictCURIE
^. \. /. ^
| X |
v / \. v
looseURI <-> strictURI
(likely implemented with flags on existing expand/contract, with new methods for like-to-like)
What is annoying is that there is AFAICT no way to get json-ld-contexts to specify the diagonal conversion
The text was updated successfully, but these errors were encountered:
I don't think that making this package strict by default will make many people happy, almost everyone in this space is in YOLO mode.
However, CURIEs can be used in both a "correct" way and an incorrect way, this is a choice of the user. We can try and help them make better choices by providing an alternate implementation of the Converter class that follows strict rules and also provides some appropriate utilities for encoding CURIEs
There are some cases of bioregistry "CURIEs" allowing square brackets in the local id. This is questionable if we follow the (IMO frustratingly opaque) W3C specs.
Here are some examples of what is permitted in bioregistry
smiles:CC(=O)NC([H])(C)C(=O)O
(it is of course a stretch to call these IDs (biopragmatics/bioregistry#460))
These work perfectly well in the context of bioregistry; clicking on this will resolve to a nice picture of a molecule, which is what most bioregistry users want.
https://bioregistry.io/reference/smiles:CC(=O)NC([H])(C)C(=O)O
Let's see what happens when we try and use this with tooling that actually supports W3C specs:
using Jena:
not pretty.. but it does process it, even in strict mode
however, it refuses to validate it
In contrast, https://json-ld.org/playground/ does not complain
I suspect the rust toolchains are stricter
Removing or escaping the
[]
s allows it to validate (note that()
s are frequently URL encoded but they are still valid)What are our options?
curies
always strict. Forbid[]
or encodings thereof. These are poor choices for bona-fide IDs. Don't try and overload the CURIE concept for languages like HGVS, UCUM, SMILES, InChi, etccuries
isn't for CURIEs as defined by W3C specs, it's just prefixed IDs that expand to URLs that work in browsers with no commitments to any specifications outside those in this repo.curies
conform to W3C specs, and force[]
s to be encoded (as the UOM people are doing for UCUM, Discussion about how to improve UCUM bioregistry#648). This could retroactively break things, and confuse people who want to usecuries
in its intended YOLO fashion[]
s, probably spaces while we are at it)I think these are all horrible but then I've always said the decision to couple identifiers to networking protocols was a terrible one.
I think 4 is likely the most practical, but this will take some careful planning. There will essentially be the following transforms:
(likely implemented with flags on existing expand/contract, with new methods for like-to-like)
What is annoying is that there is AFAICT no way to get json-ld-contexts to specify the diagonal conversion
The text was updated successfully, but these errors were encountered: