-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved implementation of PrefixMapStd #1475
base: main
Are you sure you want to change the base?
Conversation
a422853
to
bfa613e
Compare
This PR is ready for review. The small extra changes are: Is there a place where the utility methods |
I converted back to draft because I have some pending updates which I need to benchmark against the well-working parts of the original PrefixMapStd (IRIs with / or #) first in order to determine whether performance-wise it would make sense to include them. The general idea is to have PrefixMap implementation that auto-adapts to a given workload - such as parsing lots of queries without the overhead of updating reverse-lookup structures as well as updating them on-demand upon writing out RDF. In essence the direction I am investigating is about building/updating the reverse-lookup (iri-to-prefix) structures lazily (upon abbreviating). This would buffer prefix-iri inserts/deletions similar to your BufferingPrefixMap; upon abbreviate only the delta is materialized into the reverse-lookup structures. |
Another choice is to restrict to the basic case of prefix at the final "/", "#" and ":" (for URNs). Only have the "fast path" abbreviate. Do you have cases where abbreviation is not one of these? If you are going for the complicated version,maybe the best way is to have a new PrefixMapCaching and leave PrefixMapStd. |
I think your suggestions of including ':' in the list and only using fast path (without resorting to scanning) would work efficiently and be sufficient for the vast majority of use cases. Without scanning, even relative IRIs (that do not contain any of the fast track chars) wouldn't cause problems.
Right now I only have some initial experiments where I abuse the prefix map as a poor-mans dictionary encoding in order to reduce the amount of bytes that need to be parsed in order to produce triples/quads. For this I am using trie-based lookups to encode the data (so IRIs can be split anywhere), but I have yet to evaluate whether this actually gives a noticeable performance boost. |
GitHub issue resolved #1474
Reimplementation of PrefixMapStd to combine "fast-track" with trie backing.
By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.