Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of Unicode characters in String theory #412

Open
kfriedberger opened this issue Nov 24, 2024 · 2 comments
Open

Inconsistent handling of Unicode characters in String theory #412

kfriedberger opened this issue Nov 24, 2024 · 2 comments

Comments

@kfriedberger
Copy link
Member

Different solvers return different results when using Unicode characters in String theory.
This should be analyzed. Maybe we need to fix JavaSMT or report to the solvers' developers.

Details: #391 (comment)

@daniel-raffler
Copy link
Contributor

daniel-raffler commented Nov 25, 2024

Thanks for opening the issue!

I've now added some tests:

  • When creating String constants in Z3 all Unicode characters must be (SMTLIB) escaped. The solver will also escape all its output.
  • Princess will accept any Java String and does not recognize SMTLIB escape sequences for Unicode characters. We added code in our backend to handle the conversion ourselves, but Strings from the model don't get converted back to SMTLIB format when they are printed.
  • CVC4 behaves similar to Z3, but actually throws an exception when the input contains Unicode characters.
  • CVC5 is closer to Princess, even though it does not allow Unicode characters when creating String constants

It's not too difficult to convert between UTF-16 and the SMTLIB escape format, but we'll have to decide which encoding we want to use internally. Specifically, should the Strings in StringFormula StringFormulaManager.makeString(String value) and String Evaluator.evaluate(StringFormula formula) only allow SMTLIB encoded Unicode characters, or do we want to support all Java Strings?

@daniel-raffler
Copy link
Contributor

I've added some more conversions and all solvers should now behave the same. The current format got makeString(...) and evaluate(StringFormula formula) is is SMTLIB escaped Strings, but we may still change this now. The argument for keeping it that way is that we want to stay as close to the SMTLIB standard as possible. This may help interoperability, for instance when someone reads in a SMTLIB script and then tries to recreate it with JavaSMT. In that case it might be confusing if escaped Unicode characters are not properly recognized.

On the other hand JavaSMT is written in Java, and the type String in makeString(...) and evaluate(StringFormula formla) suggest that any Java String should be valid input. The conversion from Java String to the SMTLIB format (and back) is easy enough and can be handled by JavaSMT automatically, so there is no reason why we would be bound by the SMTLIB standard on this issue.

Either choice will break the API, although I'd argue that keeping SMTLIB as format is more in line with how the functions used to work so far (we just didn't document it).

In either case I would also suggest we also make escapeString(...) and unescapeString(...) available through the public API somehow. Theses methods will be needed often enough, and users shouldn't have to reimplement the rather error-prone conversion themselves. I've put both methods in FormulaCreator for now, but maybe they can be made available as default methods in the StringFormulaManager interface?

@kfriedberger, @baierd:
What's your opinion on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants