Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behavior for string operations with escaped characters #115

Open
nusakan opened this issue Aug 29, 2024 · 6 comments
Open

Weird behavior for string operations with escaped characters #115

nusakan opened this issue Aug 29, 2024 · 6 comments

Comments

@nusakan
Copy link

nusakan commented Aug 29, 2024

EYE v10.19.7 (2024-08-24)
SWI-Prolog version 9.0.4

I may be wrong but some eye reasoner behaviors may be incorrect.

Escaped characters issue

Some builtin string: <http://www.w3.org/2000/10/swap/string#> operations have unexpected behaviors when used with strings containing backslashes ("\\", "\t" ...) characters, especially when regular expressions are involved.

All provided examples are using the Notation3 syntax. (for which the single backslash "\" is not a valid syntax).

For example:

Expression Expected Obtained
"\n" string:length ?v ?v is 1 ?v is 2
("a\\b" "(\\\\)" "-") string:replace ?v ?v is "a-b" ?v is "a--b"
("a\\b" "(.*)") string:scrape "a\\b" Match No match
Other not tested string manipulation operations whatever Not confident...

A quick fix may be implemented for specific builtin core string functions.

However, it seems that the internal representation of literals relies on the N3 strings serialization syntax:
"\t" is represented as \ and t which produces these weird behaviors (even "segfault"s in some more complex cases).

So the quick fix may not be way to go: it introduces supplementary computational efforts, and only addresses partially the issue.

Thanks for your attention (and all the work you have already done).


Quick & dirty fix (v10.19.7 - for illustration purpose only).

  • Not for production use - it may introduce critical issues, and does not solve all the problems
  • Only for string builtin functions described here: https://w3c-cg.github.io/n3Builtins/ (string:scrape, string:replace, string:length)
  • Some statements wrongly match (see Sample test cases below )
  • the escape_atoms rule ensure the conversion
%
% New rule
%
escape_atoms(X, Y) :-
    when(
        (   ground(X)
        ),
        (   atom_codes(X,W),
            escape_codes(V,W),
            atom_codes(Y,V)
        )
    ),
    when(
        (   ground(Y)
        ),
        (   atom_codes(Y,V),
            escape_codes(V,W),
            atom_codes(X,W)
        )
    ).

%
% Modified rules
%
'<http://www.w3.org/2000/10/swap/string#length>'(literal(A, _), B) :-
    when(
        (   ground(A)
        ),
        (   escape_atoms(A,V),
            sub_atom(V, 0, B, 0, _)
        )
    ).

'<http://www.w3.org/2000/10/swap/string#replace>'([literal(X, _), literal(Search, _), literal(Replace, _)], literal(Y, type('<http://www.w3.org/2001/XMLSchema#string>'))) :-
    when(
        (   ground([X, Search, Replace])
        ),
        (   (   atomic_list_concat(['(', Search, ')'], Se),
                regex(Se, X, [St|_]),    % Convert back to unescaped 
                escape_atoms(S,St)   % strings before processing
            ->  (   sub_atom(Search, 0, 1, _, '^')
                ->  atom_concat(S, T, X),
                    atom_concat(Replace, T, Y)
                ;   (   sub_atom(Search, _, 1, 0, '$')
                    ->  atom_concat(T, S, X),
                        atom_concat(T, Replace, Y)
                    ;   atom_codes(X, XC),
                        string_codes(S, SC),
                        atom_codes(Replace, RC),
                        subst([[[0'$, 0'1], SC]], RC, TC),
                        subst([[SC, TC]], XC, YC),
                        atom_codes(Y, YC)
                    )
                )
            ;   Y = X
            )
        )
    ).

'<http://www.w3.org/2000/10/swap/string#scrape>'([literal(X, _), literal(Y, _)], literal(Z, type('<http://www.w3.org/2001/XMLSchema#string>'))) :-
    when(
        (   ground([X, Y])
        ),
        (   regex(Y, X, [W|_]),
            atom_string(V,W),
            escape_atoms(Z,V)
        )
    ).

Sample test cases

@prefix string: <http://www.w3.org/2000/10/swap/string#> .
@prefix : <http://example.org/test#>.

{
  "\t" string:length 1 .
  ("\\" "\t") string:concatenation "\\\t" .
  "\\\t" string:contains "\\\t" .
  "x\\A\tx" string:containsIgnoringCase "\\a\t" .
  "end\\" string:endsWith "\\" .
  "\\A\t" string:equalIgnoringCase "\\a\t" .
  ("[%s]" "\\\t\n") string:format "[\\\t\n]" .
  "\\\t1" string:greaterThan "\\\t0" .
  "\\\t0" string:lessThan "\\\t1" .
  "\\" string:matches "\\\\" .
  "\t" string:matches "\\t" .
  # "\t" string:matches "\t" .      # WARNING: wrongly resolved with no error
  "A\tB" string:notEqualIgnoringCase "a\\tB" .
  "\\\t" string:notGreaterThan "\\\t" .
  "\\\t" string:notLessThan "\\\t" .
  "\\\t" string:notMatches "\\n" .
  # "\\\t" string:notMatches "\t" . # WARNING: wrongly resolved with no error
  ("a\\b" "(\\\\)" "\t[$1]\t") string:replace "a\t[\\]\tb" .
  ("\\" "(.*)") string:scrape "\\" .
  ("\\" "(\\\\)") string:scrape "\\" .
  ("\t" "(.*)") string:scrape "\t" .
  ("\t" "(.*)")!string:scrape string:length 1 .
  "\\start" string:startsWith "\\" .
} => {
  :test :passed true .
} .
josd added a commit that referenced this issue Aug 30, 2024
josd added a commit to eyereasoner/lingua that referenced this issue Aug 30, 2024
@josd
Copy link
Collaborator

josd commented Aug 30, 2024

Thank you very much for your fix which is more or less done in the latest eye and lingua.

@nusakan
Copy link
Author

nusakan commented Sep 1, 2024

Thank you for your fast commit.

It seems however that there are still few failures. I've written a small test suite in Notation3 (extended to other string: rules), covering a larger scope (like false positives/negatives or errors).

I've uploaded here: string_test.n3.txt
(renamed to .txt as github does not recognize .n3 as text files)

You may run it with:

eye --quiet --nope string_test.n3

This test suite may not be quiet exact (I've not found an explicit documentation for the all the string: rules), you may review it before considering the test should pass or fail.

Here are the failing properties as exposed by the test suite alongside the involved tests (see the test suite for reference) (some core builtin rules are still there) with EYE v10.19.8 (2024-08-30) (master branch):

@prefix string: <http://www.w3.org/2000/10/swap/string#>.
@prefix : <http://example.com/string_test.n3#>.

string:greaterThan :fail (:t090 :t091 :t092 :t093 :t094 :t095).
string:lessThan :fail (:t110 :t111 :t112 :t113 :t114 :t115).
string:matches :fail (:t140).
string:notContainsRoughly :fail (:t150).
string:notGreaterThan :fail (:t173 :t174 :t175 :t170 :t171 :t172).
string:notLessThan :fail (:t183 :t184 :t185 :t180 :t181 :t182).
string:notMatches :fail (:t190).
string:replaceAll :fail (:t210 :t211 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).
string:substring :fail (:t261 :t262).
string:upperCase :fail (:t270).
string:contains :fail (:t020).
string:containsIgnoringCase :fail (:t030).
string:containsRoughly :fail (:t040).
string:endsWith :fail (:t050).
string:format :error (:t081).

Best regards

josd added a commit that referenced this issue Sep 1, 2024
josd added a commit to eyereasoner/lingua that referenced this issue Sep 1, 2024
@josd
Copy link
Collaborator

josd commented Sep 1, 2024

Thank you for your fast test development👍
Your cases should also perfectly fit in https://github.com/eyereasoner/eye/blob/master/reasoning/bi/biP.n3#L111
Anyway, I first try to fix the issues and some 30 of them are now fixed in https://github.com/eyereasoner/eye/releases/tag/v10.19.9 and https://github.com/eyereasoner/lingua/releases/tag/v1.4.1

The remaining ones are now

string:matches :fail (:t140).
string:notMatches :fail (:t190).
string:replaceAll :fail (:t210 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).
string:substring :fail (:t262).
string:format :error (:t081).
string:replaceAll :error (:t211 :t212).

josd added a commit that referenced this issue Sep 3, 2024
@josd
Copy link
Collaborator

josd commented Sep 3, 2024

We are now at

string:matches :fail (:t140).
string:notMatches :fail (:t190).
string:replaceAll :fail (:t210 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).

@nusakan
Copy link
Author

nusakan commented Sep 4, 2024

It appears that raised errors are independent of tests success or failure.

Errors such as:

** ERROR ** eam ** error(syntax_error(missing closing parenthesis),_624)
[...]

.. are raised with the following passing tests:

:t203 :assertNone { ("a\tb" "\\" "-") string:replace "a-tb" } .
:t213 :assertNone { ("a\tb" ("\\") ("-")) string:replaceAll "a-tb" } .
:t223 :assertNone { ("a\tb" "(\\)" ) string:scrape ?t223v } .
:t234 :assertNone { ("a\tb\nc" "(\\)") string:scrapeAll ("t") } .

While the following failing tests do not raise any error:

:t215 :assertNone { ("\tb" ("\t" "b") ("x" "y")) string:replaceAll "xy" } .
:t232 :assertNone { ("a\tb\nc" "(\t|\n)") string:scrapeAll ("\t" "\n") } .
:t244 :assertNone { ("a\\\t" "(\\+)") string:search ?t244v } .

I see no problem for regular expressions expressed with wrong syntax to have errors turned into warning messages, provided it follows the SNAF rule, as it provides substantial information (even if hard to debug). However for consistency reasons, the same behavior should be adopted in any case.

@josd
Copy link
Collaborator

josd commented Sep 4, 2024

I guess those exceptions might have to do with SWI-Prolog/packages-pcre#24 so will stay tuned.
After a few tweaks we are now at

string:replaceAll :fail (:t211 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@josd @nusakan and others