-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploring Performant PEG Parsing #2354
Comments
Hi, Seems nice ;) I have a few comments below. A On the benchmark :In Brython, we can say we have 3 steps :
I assume the generated JS code should be the same with the different parser versions, so you should have the same execution speed for steps 2&3. However, it is important to compare the execution speed at step 1 with the steps 2&3. Indeed, even if step 1 is 1,000x slower with your parser, we don't care if e.g. steps 2 is 1,000,000x slower than step 1. B Suggestion: WASM ?If we automatically generate the parser, maybe we can write the parser in e.g. C, then convert it into WASM ? C Parsing : checking multiple rules at the same time ?Parser aren't really my area of expertise. But I am a little surprised we seem to check one rule after another.
Or even a binary (?) decision tree. |
Thanks for the comments Denis!
I somewhat agree here, but a metric like "total runtime" is sometimes less important than "time to start execution". Brython users writing web apps will want their files parsed and event handlers installed as soon as possible, so users aren't waiting on an unresponsive site. My app loads an ~2kloc file to run user-created animations. Right now it takes ~150ms to parse, so the there's an almost imperceptible delay before an animation starts. But if that was a ~500ms delay instead, the experience would be worse. This program from the gallery is another good example. The Mandlebrot set takes a long time to draw. But it's nice that it doesn't take 2 seconds to start drawing.
I also find the idea of WASM exciting! But I don't think a C parser is the right next step. I think the biggest reason not to follow this approach is Pierre's lack of interest and experience in C programming. Since this is something Pierre would maintain and he's not comfortable with C or the WASM toolchain, I think a C parser is out. I'm also skeptical of how much of a speedup we'd even get from a different language, given how nicely optimizable the generated JavaScript could be.
I don't think I'm totally following here. But the algorithm doesn't check every rule against every token. It does have some |
A very well written and inspiring post Austin, thanks ! I didn't think of generating a parser with the CPython tools, if it's only about 2 times slower than the hand-made parser it's probably worth the work needed to adapt these tools to Javascript. I will work on this in the next days, unless you plan to do it yourself. |
I was not really talking about the total execution time, but the relative time between each steps.
We could also say there are a "step 0" which is the execution of brython.js, required before any parsing.
150ms seems to be a lot for "only" 2kB. Maybe there could be a way to make brython works with
Normally JS After |
@PierreQuentel Thank you!
That would be great! My holiday break is over now, so I'm not available to pick up the task myself (though it does sound like a lot of fun!) |
… (suggested by Austin Schick in issue #2354)
… PEG parser. Related to issue #2354
With the commits above, I could generate a Javascript PEG parser with the CPython toolchain, complete enough to successfully parse big Python files such as To test it, you have to
After a lot of work (I am not very good at reading C code, which was necessary to understand the logic of the generated parser and generate equivalent Javascript code), I finally could parse many Python scripts and generate the correct AST (except some positional information in some places). The last step was to measure the time taken to parse a big module in the standard distribution, and I already spoiled the conclusion : the generated parser is awfully slow, much slower than the existing PEG parser, and consequently much, much slower than the current hand-made parser. There might be possible improvements to this code, but I doubt that we can come from 6 seconds to a few hundreds of milliseconds... |
I tried to take a quick look at it. Did you test the execution speed with all the We'd need to do some profiling to see what takes a lot of time and see whether it can be optimized. |
Yes, I removed the debug traces, the rare The Javascript parser code ( |
I was wrong to doubt: using Chrome profiling tools I found a bug in a function, fixing it dramatically reduces parsing time to around 2 or 3 times the "legacy" Brython parser. The generated PEG parser is now a little faster than my previous PEG parser version. |
Woohoo! That's great news. I think I'll have some time this weekend to run the new implementation on my benchmarks and look into possible optimizations. I'm looking forward to jumping in. |
…mpare current Brython parser, first PEG parser implementation and new generated PEG parser. Related to issue #2354.
I have added 3 pages to test the different parsers versions:
They all measure the time taken to parse the code of |
I opened a PR here #2358 with a few optimizations to the generated parser. After those changes, I'm measuring the parse times to be ~1.7x those of the legacy parser. I don't feel like I exhausted the optimization opportunities, so I bet we could get even faster with some more effort. |
The version developed in branch At the moment the size of |
It looks like after merging, there are now quite a few regressions in |
@austin-schick You are right, I should have checked that... I think I have fixed most of the regressions. |
I had not fixed the issues generated by the docstring tests, and there were quite a few... The Travis CI build now passes without errors. I will publish the first release with the new parser in a few days. |
Woohoo! I see the release is now live. This is very exciting news. Thank you so much for the hard work you put into this Pierre! It looks like getting the new parser working took some serious effort. And now, as far as I can tell, Brython is more compliant with CPython than ever. Maybe I'll throw a PEG Parser Party to celebrate :-) I think with that, it's time to close this issue. Merci encore, Pierre ! |
What was I exploring, and why?
Pierre has expressed interest in replacing the current Brython parser with a PEG parser generated from Python's official grammar. He's implemented a version of this -- a parser which takes a grammar as a JS object -- but it's significantly slower than the current parser.
Recently, he said:
I agree! I think an automatically generated parser has strong pros. Like Pierre said, it'd help with maintainability and compliance with CPython. I also think a generated parser would would have fewer bugs and make the process of adding new syntax less error prone. I think maintainability, correctness, and stability are all great goals.
So I wanted to explore these questions:
I'm hoping my exploration gets somebody excited about continuing the great progress Pierre has already made in this arena.
How much slower is Brython's PEG parser, really?
I wrote a script (based on the existing
node_bridge.js
script) to test the parser's performance. It takes a Brython implementation as an argument and uses that implementation to parse most of the pyperformance benchmarks. This seemed like a nice choice because they're Python-official, varied, and meant to look like real world code.The performance results were:
So a possible answer to this question is: Brython's PEG parser is 3-3.5x slower than its handmade parser. Noticeably slower, maybe slow enough to be an argument against adopting the current PEG parser.
Why is the current PEG parser implementation slow?
I ran my perf script with node's
--prof
flag and generated a profile from it. I noticed that a large chunk of time was spent in aProxy
used to lazily generate tokens. Replacing theProxy
with an equivalent function improved the performance quite a bit.New performance results on the pyperformance suite:
Not bad! Unfortunately there was no more low hanging fruit after this point. I traced the execution of the parser on a simple example to make sure it wasn't doing extra work (I once wrote a sometimes-accidentally-exponential typechecker), and it seemed to me to be working correctly.
I then looked at performance on individual pyperformance benchmarks to find the benchmarks that had the worst performance relative to the handmade parser. This led me to learn that parsing a file full of string literals is extra bad for the PEG parser. In fact, parsing a single integer literal is pretty bad. From this point on, parsing the simple file
1
became a standard test for me.I think parsing an integer is slow because it's a case where the parser has to visit many, many grammar rules. You can see all of the rules Brython's PEG parser visits to parse
1
here. It's a lot of rules! And each of those rules triggers many additional function calls.I'm certainly no JavaScript optimization expert, but here are my guesses for why the existing PEG parser is slower than the handmade parser, at a high level:
There may be another large optimization opportunity I'm missing here. But I think the current "generic parser" architecture comes with some inherent slowness. (I did try standardizing the grammar rules to all have the same keys, with some null values, but that didn't seem to make a difference).
What is a reasonable path forward for a PEG parser?
I think much of Pierre's existing work on the PEG parser could be used to support a parser generator. Instead of storing the grammar as an object and visiting rules dynamically, we could pre-generate a function for each grammar rule, and inline some of that rule's evaluation so there are fewer function calls.
CPython takes this approach and uses this program to generate the C code for its parser. Brython could modify this file to produce JavaScript output instead.
I did a fast, ugly, proof of concept of this that used some not-totally-correct JS utilities and a custom grammar with a JS subheader and trailer, whose actions are totally wrong but are at least syntactically valid JS. To be clear, all of this works so poorly that it doesn't successfully parse
1
. But it does at least visit the same set of rules as Brython's PEG parser, which I think was enough to learn something.Here's how long it took for a couple parsers to parse
1
1,000 times:Pretty good! That means a generated parser might perform only ~2x slower than the handmade parser in the worst cases.
The "generated parser" was also ~2.5x faster than the "improved parser" for this test. If that speedup applied to the average case, then a generated parser might be just as fast as the handmade parser on average.
There's really not a lot of data here. But I think this is an approach worth exploring. The performance of this approach shows some tentative promise, I was able to get pretty far with it with only about a day and a half of work, and the idea of using CPython's tooling to generate Brython's parser feels compelling to me. We'd be taking advantage of optimization and research that the CPython project has done for their own parser, and adding that to Brython almost for free. Generating "C-like" JavaScript code also feels like a good strategy for generating fast and optimization-friendly code.
CPython's goal was to keep the PEG parser's performance within 10% of the old parser in terms of speed and memory consumption. I'm not sure if Brython has to be that strict.
The benefits of a parser generator for this project feel large enough to me to justify some slowdown. Probably not a 3.5x slowdown though. I'm not sure where that line is for me.
Thanks for reading!
I had fun diving into this. A nice unintended side effect of this exploration was learning some of the nitty gritty details of CPython's parser.
I hope what I've learned is helpful (or at least interesting!) to the Brython community. And I hope this research inspires somebody to build on Brython's PEG parser journey.
-- Austin
The text was updated successfully, but these errors were encountered: