Inconsistent handling of Unicode characters in String theory #412

kfriedberger · 2024-11-24T13:43:00Z

Different solvers return different results when using Unicode characters in String theory.
This should be analyzed. Maybe we need to fix JavaSMT or report to the solvers' developers.

Details: #391 (comment)

daniel-raffler · 2024-11-25T13:35:43Z

Thanks for opening the issue!

I've now added some tests:

When creating String constants in Z3 all Unicode characters must be (SMTLIB) escaped. The solver will also escape all its output.
Princess will accept any Java String and does not recognize SMTLIB escape sequences for Unicode characters. We added code in our backend to handle the conversion ourselves, but Strings from the model don't get converted back to SMTLIB format when they are printed.
CVC4 behaves similar to Z3, but actually throws an exception when the input contains Unicode characters.
CVC5 is closer to Princess, even though it does not allow Unicode characters when creating String constants

It's not too difficult to convert between UTF-16 and the SMTLIB escape format, but we'll have to decide which encoding we want to use internally. Specifically, should the Strings in StringFormula StringFormulaManager.makeString(String value) and String Evaluator.evaluate(StringFormula formula) only allow SMTLIB encoded Unicode characters, or do we want to support all Java Strings?

daniel-raffler · 2024-11-26T10:04:00Z

I've added some more conversions and all solvers should now behave the same. The current format got makeString(...) and evaluate(StringFormula formula) is is SMTLIB escaped Strings, but we may still change this now. The argument for keeping it that way is that we want to stay as close to the SMTLIB standard as possible. This may help interoperability, for instance when someone reads in a SMTLIB script and then tries to recreate it with JavaSMT. In that case it might be confusing if escaped Unicode characters are not properly recognized.

On the other hand JavaSMT is written in Java, and the type String in makeString(...) and evaluate(StringFormula formla) suggest that any Java String should be valid input. The conversion from Java String to the SMTLIB format (and back) is easy enough and can be handled by JavaSMT automatically, so there is no reason why we would be bound by the SMTLIB standard on this issue.

Either choice will break the API, although I'd argue that keeping SMTLIB as format is more in line with how the functions used to work so far (we just didn't document it).

In either case I would also suggest we also make escapeString(...) and unescapeString(...) available through the public API somehow. Theses methods will be needed often enough, and users shouldn't have to reimplement the rather error-prone conversion themselves. I've put both methods in FormulaCreator for now, but maybe they can be made available as default methods in the StringFormulaManager interface?

@kfriedberger, @baierd:
What's your opinion on this?

kfriedberger mentioned this issue Nov 24, 2024

Add support for Strings and Rationals to the Princess backend #391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of Unicode characters in String theory #412

Inconsistent handling of Unicode characters in String theory #412

kfriedberger commented Nov 24, 2024

daniel-raffler commented Nov 25, 2024 •

edited

Loading

daniel-raffler commented Nov 26, 2024

Inconsistent handling of Unicode characters in String theory #412

Inconsistent handling of Unicode characters in String theory #412

Comments

kfriedberger commented Nov 24, 2024

daniel-raffler commented Nov 25, 2024 • edited Loading

daniel-raffler commented Nov 26, 2024

daniel-raffler commented Nov 25, 2024 •

edited

Loading