Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialisation does not necessarily select the latest version if multiple versions exist within the same page #701

Closed
StijnDenisSirus opened this issue Oct 21, 2024 · 3 comments
Assignees
Labels
needs triage Issue needs to be evaluated by team

Comments

@StijnDenisSirus
Copy link

Describe the bug

When running an Ldio:LdesClient (ldes/ldi-orchestrator:2.9.0-SNAPSHOT) with materialisation enabled (all other materialisation properties set to default), the state object that is returned does not necessarily reflect the 'latest' version if multiple versions exist within the same page.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://ca-westtoer-ldes.bluesea-b3dcdb70.westeurope.azurecontainerapps.io/touristattractions/latestView?pageNumber=452
  2. We are interested in the object whose URI is: https://westtoer.be/id/productlist/b4215ccb-a14e-45b0-9956-9449607412fa
  3. The latest version of this object is: https://westtoer.be/id/productlist/b4215ccb-a14e-45b0-9956-9449607412fa/2024-10-17T10:44:13.6224407Z
  4. The state object returned by the client is based on https://westtoer.be/id/productlist/b4215ccb-a14e-45b0-9956-9449607412fa/2024-10-16T09:47:34.6835981Z . It is potentially not a coincidence that this is the version that is listed first on the page.
@StijnDenisSirus StijnDenisSirus added the needs triage Issue needs to be evaluated by team label Oct 21, 2024
@github-project-automation github-project-automation bot moved this to 📋 Backlog in VSDS Backlog Oct 21, 2024
@jobulcke
Copy link
Collaborator

Hi @StijnDenisSirus
I have tried to reproduce this issue with the following pipeline, based on the description above:

name: client-pipeline
input:
  name: Ldio:LdesClient
  config:
    materialisation.enabled: true
    urls: https://ca-westtoer-ldes.bluesea-b3dcdb70.westeurope.azurecontainerapps.io/touristattractions/latestView?pageNumber=452
outputs:
  - name: Ldio:ConsoleOut

With this pipeline, two state objects passes through the pipeline:

  1. At line 1305 of the provided logs
<https://westtoer.be/id/productlist/b4215ccb-a14e-45b0-9956-9449607412fa>
        rdf:type                      dcmitype:Collection;
        prov:generatedAtTime          "2024-10-16T16:27:32.8812519Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
        generiek:lokaleIdentificator  "b4215ccb-a14e-45b0-9956-9449607412fa"^^<http://www.w3.org/2000/01/rdf-schema#string>;
        generiek:naamruimte           "https://westtoer.be/id/productlist"^^<http://www.w3.org/2000/01/rdf-schema#string>;
        generiek:versieIdentificator  "2024-10-16T16:27:32.8812519Z";
        <https://schema.org/description>
                "Nieuwe testlijst"@nl;
        <https://schema.org/name>     "Testlijst Stijn 15/10"@nl;
        ns:uitsluitenVanPublicatie    false .
  1. At line 3657 of the provided logs
<https://westtoer.be/id/productlist/b4215ccb-a14e-45b0-9956-9449607412fa>
        rdf:type                      dcmitype:Collection;
        prov:generatedAtTime          "2024-10-17T10:44:13.6224407Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
        generiek:lokaleIdentificator  "b4215ccb-a14e-45b0-9956-9449607412fa"^^<http://www.w3.org/2000/01/rdf-schema#string>;
        generiek:naamruimte           "https://westtoer.be/id/productlist"^^<http://www.w3.org/2000/01/rdf-schema#string>;
        generiek:versieIdentificator  "2024-10-17T10:44:13.6224407Z";
        <https://schema.org/description>
                "Nieuwe testlijst"@nl;
        <https://schema.org/name>     "Testlijst Stijn 15/10"@nl;
        ns:uitsluitenVanPublicatie    false .

@StijnDenisSirus
Copy link
Author

StijnDenisSirus commented Nov 7, 2024

Hi @jobulcke

Thank you for looking into this. When using the simple client-pipeline you provided, I obtained the same results, with those exact state objects passing through the pipeline in the correct order (see lines 1384 and 3733 in the attached logs).

orchestrator:
  pipelines:
    - name: client-pipeline
      input:
        name: Ldio:LdesClient
        config:
          materialisation.enabled: true
          urls: https://ca-westtoer-ldes.bluesea-b3dcdb70.westeurope.azurecontainerapps.io/touristattractions/latestView?pageNumber=452
      outputs:
        - name: Ldio:ConsoleOut

When looking at the differences between this simple client-pipeline and the actual pipeline we use in our application, my colleague and I discovered after some testing that the issue appears to be related to the keep-state property; more specifically the sqlite persistence strategy. When using the following client, I can replicate the original issue where multiple state objects pass through the pipeline and the last one shown in the logs has a generatedAtTime of "2024-10-16T09:47:34.6835981Z" - the version that is listed first on the page (See lines 1456, 1512, 3985, 7431, 7557 and finally 8770 in the attached logs for a keep-state sqlite run).

orchestrator:
  pipelines:
    - name: client-pipeline
      input:
        name: Ldio:LdesClient
        config:
          materialisation.enabled: true
          urls: https://ca-westtoer-ldes.bluesea-b3dcdb70.westeurope.azurecontainerapps.io/touristattractions/latestView?pageNumber=452
          keep-state: true
          state: sqlite
      outputs:
        - name: Ldio:ConsoleOut

@jobulcke jobulcke self-assigned this Nov 20, 2024
@jobulcke
Copy link
Collaborator

Hi @StijnDenisSirus

tl;dr
Due to sqlite having a long processing time, we either recommend adding the timestamp path of the LDES to the ldes-client properties, so the members are always ordered before being processed, or upgrading to ldes/ldi-orchestrator:2.10.0-SNAPSHOT, where the ordening happens automatically

Long answer
I was indeed able to reproduce this issue with the provided pipeline with the ldes/ldi-orchestrator:2.9.0-SNAPSHOT image, the issue probably occured due to a timing issue and sqlite having a slow processing time (something like the latest state is being written to the db while an earlier state later on the page is already being processed, but this process cannot read the latest state in the db yet).

In this version. and earlier on, it is possible to add a timestamp path in the ldes-client config, which orders the members according to the timestamp when the config property was provided. If left blank. the latest-state-filter would use the default value (http://www.w3.org/ns/prov#generatedAtTime), which is the timestamp path your ldes uses and why the filter works. Adding this property to the pipeline will resolve this issue, as the latest state on the same page will always be processed lastly.

However, from version 2.10.0, the client fetches the timestamp-path and the version-of-path from the event stream by itself. This results in that the members are always ordered before being processed, which fixes this issue automatically, as here again the latest state on the same page will always be processed lastly.

@github-project-automation github-project-automation bot moved this from 📋 Backlog to 👀 In review in VSDS Backlog Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue needs to be evaluated by team
Projects
Status: 👀 In review
Development

No branches or pull requests

2 participants