Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize prefix over HTTP and HTTPS URIs #156

Open
ddeboer opened this issue Nov 30, 2021 · 11 comments
Open

Generalize prefix over HTTP and HTTPS URIs #156

ddeboer opened this issue Nov 30, 2021 · 11 comments
Labels
enhancement New feature or request entailment SPARQL entailment regimes

Comments

@ddeboer
Copy link

ddeboer commented Nov 30, 2021

Why?

Vocabularies are transitioning from HTTP to HTTPS URIs, for example Schema.org and CreativeCommons. Because the HTTP scheme is – unhappily – part of the URI, this change has implications for SPARQL queries. This problem will become even more widespread in the future when more vocabularies change their HTTP scheme.

When executing SPARQL queries against resources for which it can’t be predicted whether they will be using HTTP or HTTPS URIs, workarounds as well as normalization are necessary in client applications. For example:

  • perform the same query in a UNION, once with PREFIX schema: <http://schema.org/> and once with PREFIX schema: <https://schema.org/>;
  • or detect which HTTP scheme is used and adjust the query accordingly.

Previous work

None that I know of.

Proposed solution

Solve the problem generically on the SPARQL-level, so client-side workarounds are no longer necessary. For example SPARQL could accept prefixes without HTTP schema that then work on both HTTP and HTTPS URIs:

PREFIX schema: <schema.org/>

SELECT * WHERE { ?s a schema:Dataset }
# Returns both <https://example.com/resource> a <http://schema.org/Dataset> 
# as well as <https://example.com/resource> a <https://schema.org/Dataset> 

Considerations for backward compatibility

@JervenBolleman JervenBolleman added the enhancement New feature or request label Nov 30, 2021
@JervenBolleman
Copy link
Collaborator

I think this is a very nice usability enhancement. And wonder if this should be a generic adaptation not just to the prefix declaration but to all IRI equality testing or as an special kind of entailment.

@JervenBolleman JervenBolleman added the entailment SPARQL entailment regimes label Nov 30, 2021
@namedgraph
Copy link

Isn't this a bad practice by schema.org? If so, why should it be normalized?

@dbooth-boston
Copy link
Collaborator

I agree that this could be viewed as special kind of entailment, but I am very skeptical about using the PREFIX syntax for specifying it.

  1. It would break alignment with Turtle.
  2. Would this proposal treat <http://example.com/foo> as equivalent to <ftp://example.com/foo>? What about <urn:example.com/foo>?
  3. If schema:Dataset were used in an INSERT statement, what URI would be inserted?

If this kind of entailment is desired, I think it would be cleaner to treat it explicitly as a form of entailment, using existing mechanisms for specifying entailment regimes.

@ddeboer
Copy link
Author

ddeboer commented Nov 30, 2021

Isn't this a bad practice by schema.org? If so, why should it be normalized?

The problem is not specific to Schema.org, but relevant for all vocabularies etc. that want to migrate their URIs from HTTP to HTTPS.

I agree that this could be viewed as special kind of entailment, but I am very skeptical about using the PREFIX syntax for specifying it.

I agree with the downsides that you mention. Most important to me is solving this issue on the SPARQL-level, not which particular SPARQL solution is picked. So if we forget about the PREFIX approach for now, how would a solution look using entailment? Would that solution be:

  • concise enough to be usable (something like <https://schema.org/Article> owl:sameAs <http://schema.org/Article>, which would have to repeated for all Schema.org things and properties);
  • generic enough as long as not all query engines support entailment (e.g. Comunica)?

@dbooth-boston
Copy link
Collaborator

I have always viewed this problem as part of the usual need to normalize one's data as part of the data intake or ETL process. In other words, normalize those URIs to http: or https: before they are stored into your SPARQL server.

The normalization could also be done within a SPARQL server, using URI pattern matching and rewriting, etc., and storing the normalized result to a separate graph, but the SPARQL code that's needed to do that is a bit messy. URI munging is not SPARQL's strong suit.

@JervenBolleman
Copy link
Collaborator

For me this is an usability issue. It's easy to forget which ontology dataset uses https and which ones http. Once federating queries it is even harder.

@mielvds
Copy link

mielvds commented Dec 1, 2021

Ideally, this would be fixed on the data intake and ideally, we would use entailment. However, as a query client, you have no guarentees over the dataset or the entailment regime. Also, entailment is a rather complex way to solve such a common issue and it can yield results that are suprising to the client ("how did these URIs get in here? They are nowhere in my query."). You definitely can have both of course. So I agree with @JervenBolleman: this is about improving usability for the one who's writing the query.

I wonder whether you could have something like a UNION PREFIX similar to what graphql has for types? UNION PREFIX s: <https://schema.org/> | <http://schema.org/>

@rubensworks
Copy link
Member

I agree this is a usability issue that should be solved somehow, but I'm not a big fan of solutions that are based on modifying the query syntax (for the reasons listed by @dbooth-boston).

If I understand correctly, the suggested PREFIX extensions would only be able to cope with prefixed URLs defined in the query, but not within the dataset.
E.g., the following query would not produce the expected result if ?type in endpoint 1 is https://schema.org/Dataset and in endpoint 2 http://schema.org/Dataset:

PREFIX schema: <schema.org/>
SELECT * WHERE {
  SERVICE <urn:endpoint1> { ?s a ?type }
  SERVICE <urn:endpoint2> { ?s a ?type }
}

I think introducing a dedicated (and lightweight) entailment regime might be acceptable for this. Especially since the implementation of this feature will require entailment in any case.

@afs
Copy link
Collaborator

afs commented Dec 2, 2021

I agree that handling it at data ingestion and in implementation feature is a better route. (The relative URI syntax is already legal!)

I'm also not keen on addressing migration issues as a permanent feature of the language.

What would be good is a "practice and experience" note.

@VladimirAlexiev
Copy link
Contributor

@JervenBolleman makes a very important point #156 (comment): this is only one aspect of IRI equality testing. Sadly, the same IRI written with and without percent-encoding is neither equal nor equivalent:

select (?iri1=?iri2 as ?equal) (sameTerm(?iri1,?iri2) as ?same) {
    values (?iri1 ?iri2) {(<urn:foo%2Dbar> <urn:foo-bar>)}
}

Most modern websites redirect http to https, for any resource. I think this is the good behavior.

I think that schema.org gives a mixed signal by promoting https variants of their semantic terms.
They have 2 versions of their ontology, but only an https version of their context.

But no matter this mixed signal, thousands of website admins will use https in their data, and thousands more will use http.
So the problem @ddeboer raised is legitimate and important.

@VladimirAlexiev
Copy link
Contributor

#158: IANA rebukes coap*

CoAP registers different URI schemes for accessing CoAP resources via different protocols. This approach runs counter to the WWW principle that a URI identifies a resource and that multiple URIs for identifying the same resource should be avoided

Curiously, it fails to render such rebuke for http: vs https:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request entailment SPARQL entailment regimes
Projects
None yet
Development

No branches or pull requests

8 participants