Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work-around Tree-Sitter regex inability to match certain unicode chars. #354

Merged
merged 1 commit into from
Nov 10, 2024

Conversation

gundermanc
Copy link
Contributor

@gundermanc gundermanc commented Nov 1, 2024

I noticed that Tree-Sitter-C-Sharp is unable to match certain types of Unicode characters, like those prevalent in Chinese. In my testing, the C++ grammar doesn't seem to have this problem.

This PR ports the identifier regex used by C++ over to C#, dropping just the part of it that matches $, enabling correct parsing of identifiers that contain Chinese characters.

@gundermanc
Copy link
Contributor Author

I had to touch a few other places to get the linter to pass.

@gundermanc gundermanc marked this pull request as ready for review November 1, 2024 20:34
@amaanq
Copy link
Member

amaanq commented Nov 3, 2024

I had to touch a few other places to get the linter to pass.

No worries, that's on me for not updating it when I added the new eslint config

As for the actual change, I'm not 100% sure this is the correct pattern for valid C# identifiers, but it looks fine to me. I wonder if regenerating with the latest tree-sitter master without your change would actually fix your issue with Chinese unicode characters as identifiers, since that was a bug fixed recently with our unicode tables 🤔 do you mind trying that out or pasting the valid code that fails to parse so I can try it out?

@gundermanc gundermanc force-pushed the topic/unicode-identifiers branch from 637ada1 to 5f7a430 Compare November 5, 2024 04:45
@gundermanc
Copy link
Contributor Author

I'm not 100% sure this is the correct pattern for valid C# identifier

My understanding is that identifiers have an optional @ at the start (not reflected in the grammar before, but I added this revision), followed by one letter, and 0 or more alpha numeric characters. I'm not sure what this means in Unicode terms, but I've tried some examples of Chinese characters, and they seem to match consistent with how they compile with the .NET compiler and VS intellisense.

do you mind trying that out

It still fails with tip of main Tree-Sitter.

pasting the valid code that fails to parse so I can try it out?

Yep, this compiles in VS in C#.

            class Class狧麱狵錋狾龍龪啊阿埃挨哎唉0u㐀㐁㐂㐃㐄㐅㐆㐇6ⅶ0ǒoU1U2U38丂丄丅丆丏丒丟
            {
                void Test狧麱狵錋狾龍龪啊阿埃挨哎唉0u㐀㐁㐂㐃㐄㐅㐆㐇6ⅶ0ǒoU1U2U38丂丄丅丆丏丒丟()
                {
                    Console.WriteLine("Hello World!");
                }
            }

@gundermanc
Copy link
Contributor Author

Hey @amaanq, do you mind taking another look when you have a chance?

Thank you again for your time!

@amaanq amaanq merged commit bb22f5b into tree-sitter:master Nov 10, 2024
1 of 4 checks passed
@amaanq
Copy link
Member

amaanq commented Nov 10, 2024

thanks, sorry was on a trip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants