Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL CONSTRUCT support #42

Open
tirrolo opened this issue Mar 10, 2024 · 21 comments
Open

SPARQL CONSTRUCT support #42

tirrolo opened this issue Mar 10, 2024 · 21 comments
Labels
bug Something isn't working help wanted Extra attention is needed working-group

Comments

@tirrolo
Copy link

tirrolo commented Mar 10, 2024

Consider example in specification:

<#SPARQLEndpoint> a rml:LogicalSource;
    rml:source [ a rml:Source, sd:Service;
        sd:endpoint  <http://example.com/sparql>;
        sd:supportedLanguage sd:SPARQL11Query;
    ];
    rml:iterator "CONSTRUCT WHERE { ?s ?p ?o. } LIMIT 100";
    rml:referenceFormulation formats:SPARQL_Results_CSV;
.

Since the iterator uses a CONSTRUCT, the reference formulation format cannot be SPARQL_Results_CSV. Suggest either using a SPARQL SELECT form, or change the reference formulation format(?).

@DylanVanAssche
Copy link
Collaborator

Yes! We (me and @pmaria) spotted this a few ago.

In this example, we should have used SELECT indeed.
However, we still need to figure out how to deal with CONSTRUCT.
Suggestions & Pull Requests are welcome!

@DylanVanAssche DylanVanAssche added bug Something isn't working help wanted Extra attention is needed labels Mar 11, 2024
DylanVanAssche added a commit that referenced this issue Mar 11, 2024
CONSTRUCT cannot have SPARQL CSV Results, fix that.
Contributes to #42
@DylanVanAssche
Copy link
Collaborator

Fixed the CONSTRUCT in main

@DylanVanAssche DylanVanAssche changed the title Format issue in CONSTRUCT Example SPARQL CONSTRUCT support Mar 11, 2024
@DylanVanAssche
Copy link
Collaborator

I updated the title to reflect better the important issue here: SPARQL CONSTRUCT support.

Any suggestions on how to add that, would be highly appreciated!

CC: @pmaria

@dachafra
Copy link
Member

Do we really need to accept construct queries? Are we accepting ASK, and DESCRIBE as well?

Construct already retrieves RDF, and we would need to define a reference formulation for RDF. Do we want to open that box? Where is the use-case/necessity?

@andimou
Copy link

andimou commented Mar 17, 2024

If we allow SPARQL descriptions, we cannot restrict the type of queries. Whatever the SPARQL descriptions' recommendation allows is a potential entry.

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

If we allow SPARQL descriptions, we cannot restrict the type of queries. Whatever the SPARQL descriptions' recommendation allows is a potential entry.

Why is that?

Could we not define the details of a reference formulation where we stipulate that e.g. only SELECT and ASK queries are allowed?

@andimou
Copy link

andimou commented Mar 19, 2024

The reference formulation refers to how we access the data which is available in a logical source. How the data in the Logical Source were retrieved is beyond the scope of the Reference Formulation.

In this case, all what we say is that the data of the Logical Source is retrieved from a SPARQL endpoint which is described via a SPARQL description. If we do not want all data from the SPARQL endpoint, we may define a query but in this case, the query is supposed to be any SPARQL query. One can say in an implementation that I support only SELECT and ASK queries but this is beyond RML. The Reference Formulation would only tell you how to process the data after the SELECT or ASK query but it would not indicate what the results of the query would be or the format of the data in the Logical Source.

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

The reference formulation refers to how we access the data which is available in a logical source. How the data in the Logical Source were retrieved is beyond the scope of the Reference Formulation.

In this case, all what we say is that the data of the Logical Source is retrieved from a SPARQL endpoint which is described via a SPARQL description. If we do not want all data from the SPARQL endpoint, we may define a query but in this case, the query is supposed to be any SPARQL query. One can say in an implementation that I support only SELECT and ASK queries but this is beyond RML. The Reference Formulation would only tell you how to process the data after the SELECT or ASK query but it would not indicate what the results of the query would be or the format of the data in the Logical Source.

I respectfully disagree. Any reference formulation should define:

  1. How to formulate an logical iteration on a source
  2. How to resolve a reference expression on a logical iteration
  3. If possible, how to naturally map values from the logical iteration to RDF values.

If, for SPARQL, we can do this all in 1 reference formulation, that's great. But, currently it is quite unclear how that would work.

@andimou
Copy link

andimou commented Mar 19, 2024

I respectfully agree with all what you say but none of what you say has anything to do with the SPARQL query but with all what comes after the SPARQL query. Either you have a SPARQL query or not, what you say should be defined in a reference formulation, we do not disagree on this. But how we fetch the data from a data source is independent of the reference formulation. If one used a SPARQL query or not to retrieve a set of RDF triples is independent of how one refers to these triples. If one used a SELECT query to retrieve some CSV results, then the reference formulation refers to these CSV results and not to the SPARQL query.

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

[ ... ] But how we fetch the data from a data source is independent of the reference formulation.

This is where we are disagreeing. How we are fetching data from a source (the rml:iterator) is most definitely part of the definition of the reference formulation and not independent of it.

@andimou
Copy link

andimou commented Mar 19, 2024

The rml:iterator does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.

How an iteration is computed is indeed part of the reference formulation, but that is something different than how the data is retrieved (the SPARQL query in this case).

What an iteration returns (independently of which reference formulation we use) is a set of key-value pairs that RML can then consider to create the RDF triples of each iteration.

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

The rml:iterator does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.

I must still disagree with this statement.

Take these examples

    rml:referenceFormulation rml:XPath;
    rml:iterator "/xpath/iterator/expression";
     rml:referenceFormulation rml:JSONPath;
     rml:iterator "$.jsonpath.expression";
    rml:iterator "SELECT * FROM student;";
    rml:referenceFormulation rml:SQL2008Query;
    rml:iterator "SELECT { s? ?p ?o } WHERE { ?s ?p ?o. } LIMIT 100";
    rml:referenceFormulation formats:SPARQL_Results_CSV;

All these iterator expressions are specifying what data is to be considered part of the iteration. This includes how we fetch the data.
How the results are formed into a logical iteration is not part of the iterator expression, but part of the (currently implicit) rules of the reference formulation. Of course, the iterator must provide an iterable result in accordance with the rules of the reference formulation for a logical iteration to be formed.

There are also logical sources where an iterator is not necessary, since there is a natural way to form a logical iteration on those sources, like with CSV. But again, here, how we iterate is determined by the rules of the reference formulation.

This discussion is an example of why we need more clarity on the definition of the reference formulations.
For example, how are logical iterations formed on a JSONPAth expression response? This is currently not specified anywhere.

The same question can be asked for SPARQL CONSTRUCT queries, to bring it back to this issue.

What an iteration returns (independently of which reference formulation we use) is a set of key-value pairs that RML can then consider to create the RDF triples of each iteration.

I do not believe that we can say that an iteration always returns key-value pairs. This is very much dependent on the reference formulation and how reference expression are to be evaluated against the logical iterations in that reference formulation.

I would say an iteration returns a logical iteration where each iteration is a sub-part of the source on which reference expressions can be evaluated to return values from the source data. How this works exactly should also be defined in the reference formulation.

For example, in the rml:JSONPath reference formulation both the iterator and the reference expressions use the JSONPath standard query language. Thus, the logical iteration consists of sub-documents of the JSON source on which subsequent reference expressions in the JSONPath language can be evaluated.
However, the rml:SQL2008Query reference formulation uses SQL for the iterator and column names as reference expressions. This could indeed be defined as key-value pairs, but that is probably too simplistic in the case of RDBs.

The point is that every reference formulation needs to define these specific aspects.

@tirrolo
Copy link
Author

tirrolo commented Mar 19, 2024

EDIT: Pano already provided a more detailed answer.

The rml:iterator does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.

At the moment it looks a bit mixed to me. Consider the following example from the IO spec:

<#RDB> a rml:LogicalSource;
    rml:source [ a rml:Source, d2rq:Database;
        d2rq:jdbcDSN "jdbc:mysrml://localhost/example";
        d2rq:jdbcDriver "com.mysql.jdbc.Driver";
        d2rq:username "user";
        d2rq:password "password";
    ];
    rml:iterator "SELECT * FROM student;";
    rml:referenceFormulation rml:SQL2008Query;

To me, here the iterator is stating how the data is actually fetched. Maybe there is something I misunderstood?

@andimou
Copy link

andimou commented Mar 19, 2024

This use if rml:iterator is new to me.. when was that agreed?

@pmaria
Copy link
Collaborator

pmaria commented Mar 19, 2024

Ah, I think you are referring to rml:query which was used on some rml:Source descriptions.
That was this issue: #28

@andimou
Copy link

andimou commented Mar 19, 2024

ok I missed this, is it too late that I disagree on this use of the rml:iterator?!

the iterator was meant to indicate how we "traverse" the data, how we iterate over the data we have, it's a pattern that repeats in the data. What pattern can you have if the iterator is a query?

@DylanVanAssche
Copy link
Collaborator

the iterator was meant to indicate how we "traverse" the data, how we iterate over the data we have, it's a pattern that repeats in the data.

The way an iterator is currently used is as followed:

  • Reference formulation explains how to reference to the data
  • Iterator contains an expression in a language suitable for the data source to iterate over entries in the data source. The language is specified by the reference formulation as an iterator.

This was discussed during a W3C CG meeting and was accepted.

What pattern can you have if the iterator is a query?

  • For JSONPath/XPath, the iterator has a JSONPath/XPath expression that 'creates' iterations over the document. Each iteration is a JSONPath/XPath result.
  • For SQL (query, tablename is a shortcut), the iterator has a SQL query expression that 'creates' iterations over the document. Each iteration is a SQL query result.
  • For CSV, the iterator is not present as it has a default row-based iterator that does the same: it 'creates' iterations over the document as CSV rows. Each iteration is a CSV row.
  • For SPARQL, a SPARQL query is the iterator that 'creates' iterations over the triples. Each iteration is a SPARQL query result.

@pmaria
Copy link
Collaborator

pmaria commented Mar 20, 2024

You could argue that the iterator was always a query.

For example an XPATH expression is basically a query on a document. Same goes for JSONPath. So the distinction between iterator and query was always questionable, as is argued in #28.

It is the reference formulation that determines whether the result of the expression/query is iterable, and how it should be iterated.

@tirrolo
Copy link
Author

tirrolo commented Mar 20, 2024

You could argue that the iterator was always a query.

For example an XPATH expression is basically a query on a document. Same goes for JSONPath. So the distinction between iterator and query was always questionable, as is argued in #28.

It is the reference formulation that determines whether the result of the expression/query is iterable, and how it should be iterated.

Which brings us back to the original issue. It is unclear to me how an "iteration" would be performed for CONSTRUCT queries. Probably we will have to define that at some point, though. At the moment, we are in a quite bizarre situation where we accept JSON, CSV, and XML files as input. Not Turtle files, for instance. But can we really justify that? At a first sight it looks a bit arbitrary, and I see this related to the issue of the CONSTRUCT.

Maybe we could argue that for Turtle and other formats we do not really have a reference formulation, that is, a query language that we can use to access file elements. But I am unsure.

@andimou
Copy link

andimou commented Mar 20, 2024

You could argue that the iterator was always a query.

this is not 100% correct. An iterator in the case of tables in R2RML is not a query. It just happens that the query language is used as the reference formulation for some cases and that's why the iterator may be a query.

However, having an SQL query as an iterator, that would return a table, this is not an iteration pattern or well, it is, but it has only 1 iteration, the complete table because this table does not repeat within a table.

It is unclear to me how an "iteration" would be performed for CONSTRUCT queries.

this is correct. I think when we proposed RDF as input for the first time, we considered the iteration pattern to be every triple. That was never mentioned anywhere nor specified. If we accept RDF as potential input, then we need a reference formulation to refer to the RDF triples/quads and that reference formulation would also give us the iteration pattern.

@pmaria
Copy link
Collaborator

pmaria commented Mar 20, 2024

You could argue that the iterator was always a query.

this is not 100% correct. An iterator in the case of tables in R2RML is not a query. It just happens that the query language is used as the reference formulation for some cases and that's why the iterator may be a query.

However, having an SQL query as an iterator, that would return a table, this is not an iteration pattern or well, it is, but it has only 1 iteration, the complete table because this table does not repeat within a table.

It may have not been intended in that way, but I believe the JSONPath and XPath reference formulations were the only defined reference formulations for which the iterator was actually specified in the mappings in earlier versions of RML.
In those versions no iterator would have been specified for any type of SQL logical source.
So in practice the iterator was always an expressed query for these formulations. Hopefully this clarifies my point.

Another way to look at it: we could replace rml:iterator with rml:query and the mappings would still make sense. Maybe even more so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed working-group
Projects
None yet
Development

No branches or pull requests

5 participants