Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty unicode escape handling leads to tokenizing failure #99

Open
xmcp opened this issue Jun 6, 2021 · 1 comment
Open

Faulty unicode escape handling leads to tokenizing failure #99

xmcp opened this issue Jun 6, 2021 · 1 comment

Comments

@xmcp
Copy link

xmcp commented Jun 6, 2021

It seems that javalang replaces unicode escapes back to the raw form (as pointed out in issue #58) in pre_tokenize method before tokenizing.

I don't get why this replacement is necessary (pre_tokenize method is added since the initial commit), and this may lead to failures in rare conditions.

Example:

>>> import javalang
>>> javalang.parse.parse(r'class Foo { String bar = "\u0022"; }')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python38\lib\site-packages\javalang\parse.py", line 52, in parse
    parser = Parser(tokens)
  File "C:\Program Files\Python38\lib\site-packages\javalang\parser.py", line 95, in __init__
    self.tokens = util.LookAheadListIterator(tokens)
  File "C:\Program Files\Python38\lib\site-packages\javalang\util.py", line 92, in __init__
    self.list = list(iterable)
  File "C:\Program Files\Python38\lib\site-packages\javalang\tokenizer.py", line 535, in tokenize
    self.read_string()
  File "C:\Program Files\Python38\lib\site-packages\javalang\tokenizer.py", line 201, in read_string
    self.error('Unterminated character/string literal')
  File "C:\Program Files\Python38\lib\site-packages\javalang\tokenizer.py", line 576, in error
    raise error
javalang.tokenizer.LexerError: Unterminated character/string literal at """, line 1: class Foo { String bar = """;

PR #96 fixes this issue and maybe we should merge it?

@c2nes
Copy link
Owner

c2nes commented Jun 7, 2021

This behavior looks correct to me. The above program, after Unicode escape processing is,

class Foo { String bar = """; }

and the error you are receiving is consistent with the Java compiler,

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.10+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.10+9, mixed mode)

$ cat Foo.java
class Foo { String bar = "\u0022"; }

$ javac Foo.java
Foo.java:1: error: unclosed string literal
class Foo { String bar = "\u0022"; }
                                ^
Foo.java:1: error: reached end of file while parsing
class Foo { String bar = "\u0022"; }
                                    ^
2 errors

This behavior is dictated by the Java Language Specification. These two sections in particular,

The short version being that Unicode escapes are processed before any other tokenization or parsing is performed.

It would however make sense for javalang to preserve the original text to use when calculating positions and reporting errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants