-
-
Notifications
You must be signed in to change notification settings - Fork 6
Home
A python PEG parser generator and processor intended to be clear, easy, practical
pijnu
was created in 2009 by Denis "Spir" Derman and then transferred to Peter Potrowl (peter17 on GitHub) in June 2011.
This document was created by Spir. So, in the following text, "I" refers to him.
changes:
~ Split the package in organized sub-packages: library, generator, doc, samples.
~ Consistent import system (still little things are unclear).
~ Added a module gen.py
to create a parser from command line.
See section on "evolution" for details.
major change:
Fixed a bug that made parsing generation 50X slower ;-)
Due to the fact that wrapper patterns recorded sub-pattern error messages instead of references to error objects. This made all error messages to be written [this is (too) slow] for nearly all match failures, while normally none of them will be output (~ 10000 failures for pijnu
's own meta-parser).
Also added statistics at parser creation.
major change:
List transformation function, called intoList
.
Now, the user does not need to cope with detail transformations (drop
, extract
, liftValue
) on sub-patterns for a separated list. Typically, the scheme is as follows.
itemList : item (SEP item)* : intoList
will not yield anymore results like:
itemList:[item [[SEP item] [SEP item]...]]
but instead directly:
itemList:[item item item...]
intoList
also works when the minimum number of items is more than 1, as is usual for operations with 2 or more operands:
sum : operand PLUS operand (PLUS operand)* : intoList
will now yield results like:
sum:[operand:1 operand:2 operand:3...]
first real public release
Introduced the user guide, also online at guide.
Fixed some unimportant details while writing the user guide.
Most issues remain, but pijnu
is globally nicely usable as is.
first upload to site Numerous issues remain: see section on "evolution" below.
What is Parsing Expression Grammar?
See also below section on "grammar & parser" for a first overview.
- Wikipedia article on PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar
- PEG & packrat parser: http://pdos.csail.mit.edu/~baford/packrat/
- Thesis on PEG & packrat parsing: http://pdos.csail.mit.edu/~baford/packrat/thesis/
- Article by Bryan Ford: http://www.brynosaurus.com/pub/lang/peg.pdf
- Disadvantages of PEG: http://www.experiencefestival.com/parsing_expression_grammar_-_disadvantages
- University course on PEG: http://iamwww.unibe.ch/~scg/Teaching/CC/11PEGs.pdf
pijnu
syntax is a custom, extended version of Parsing Expression Grammars (PEG -- see links above); which itself is a kind of mix of BNF and regular expressions.
The major difference is that PEG is a grammar to express string recognition patterns, while BNF or regexp express string generation. As a consequence, PEG is better suited for parsing tasks. A PEG grammar clearly encodes the algorithm to parse a source string, that simply needs to be rewritten into a parser coded in a programming language.
pijnu
generated parsers use a library to parse source texts. This library mainly holds pattern classes, a Node type for the resulting parse tree, and several kinds of tool functions.
The parser is produced from the grammar by a generator which, indeed, itself "meta-uses" the library. For the story, all but the first generator were themselves "meta-produced" by previous versions of the generator: pijnu
is bootstrapped.
### simple arithmetics with '+' and '*' only
formula
<definition>
# tokens
ADD : '+'
MULT : '*'
LPAREN : '('
RPAREN : ')'
digit : [0..9.]
number : digit+
# operations
mult : (grup/number) MULT (grup/mult/number)
add : (grup/mult/number) ADD (grup/add/mult/number)
grup : LPAREN (add / mult) RPAREN
formula : add / mult / number
A parsing phase produces a parse tree in which every node was yielded by a pattern. Simple leaf nodes hold the matched string snippet while branch nodes contain a sequence of child nodes. A major issue in text processing applications is that a raw parse tree is far from having a form well suited for further processing.
Using pijnu
, one can do far more that getting a parse tree. The grammar allows assigning transformation functions to patterns, that will then be applied to every generated node. Numerous in-built transformations are provided in order to easily restructure the resulting parse tree and/or modify specific nodes.
Moreover, a user can write custom functions right inside the grammar that will then be applied to directly perform further processing. This is a both very simple and highly powerful method. In most cases, one can get final results "like magic".
For instance, to compute the actual result from the above formula grammar, one needs only 2 single-line functions: one for each operation indeed. Then, the result of the parsing/processing process is the result of the expressed formula.
Another example that will generate XHTML from wiki-text styled lines (possibly nested), using a single 3-lines function:
### parse wiki-text styled lines and rewrite them into XHTML
wikInline
<toolset>
def styledSpan(node):
klass = node.typ
text = node.value
node.value = '<span class="%s">%s</span>' %(klass,text)
<definition>
# codes
ESCAPE : '~'
DISTINCT : "//" : drop
IMPORTANT : "**" : drop
styleCode : (DISTINCT / IMPORTANT)
# character expression
escChar : ESCAPE ('*' / '!' / '/' / ESCAPE) : join
validChar : [\\x20..\\xff !!/!*~]
rawText : (escChar / (!styleCode validChar))+ : join
# text kinds
distinctText : DISTINCT inlineText DISTINCT : liftValue
importantText : IMPORTANT inlineText IMPORTANT : liftValue
styledText : distinctText / importantText : styledSpan
inlineText : (styledText / rawText)+ : @ join
The column on right side assigns transformations to patterns. drop
, join
, and liftValue
are builtin. styledSpan
is a custom transformation. '@'
denotes a recursive pattern.
See the guide & tutorial from the link above for details.
As a tool, pijnu
is hopefully clear and efficient for the user.
It provides highly informative feedback -- maybe too much ;-) even -- about patterns, results and exceptions.
Custom extensions from PEG help defining legible grammars -- there may be more in the future. There are also pre-processing functions and configuration parameters that may be worthful in practical cases, but still need be fully integrated (see below section on "evolution").
Typically, a user will define the grammar, import the generator and let it write a corresponding parser. This parser comes in the form of a python module from which a parser object can be imported. The said parser object and each of its patterns can be used to match a source text partially or completely, find first or all occurrences of matches, or replace found matches. In most cases, transformation will restructure and further process the resulting parse tree.
from pijnu import generator
generator.writeParser(myGrammar)
from myGrammarParser import myGrammarParser
myGrammarParser.match(source)
New: It is now possible to directly produce a parser from the command line using the gen.py
module (later may be renamed to pijnu.py
):
python gen.py myGrammar.pijnu myParser.py
Here is a sketch of possible future evolution of pijnu
as a list of TODO, questions & issues.
Note by Peter17: This list should be verified to be up-to-date.
As indicated on this page and on each file of the software, pijnu
is distributed under the free license GPLv3.
should pijnu
be optimized for speed?
pijnu
has been designed as a clear and practical tool. Every trade-off was made according to these principles, which should allow for fast use, in human time. Conversely, machine run time speed has never been considered as a relevant goal. As a consequence, pijnu
seems to be very slow.
I personally have no skill for this; and also no motivation as of now. Comments, critical use cases, & contributions welcome.
a clear packaging and importing system
There is no packaging using __init.py__
file(s) as of now. While I still haven't figured out how they are supposed to help the developer or the user (all works fine anyway), it may be a good idea at least to comply with standards.
The whole application should be split into library, generator, samples, and doc folders. Imports should be clarified to allow importing either the generator to create parsers, or else the library rather for exploration (usually the lib is not needed, for the parsers import it transparently), or even both.
Also, there should be a way to allow the user import all names from library modules directly through import *
; which means that a lib module, maybe __init__.py
, should import all of them.
add simple & fast command line mode
python pijnu.py grammarFileName
should generate a parser in grammarTitleParser.py
python pijnu.py grammarFileName parserFileName
should generate a parser in parserFileName
To be added in pijnu.py
, or maybe __init__.py
, top-level source file. (actually in gen.py
because of import mess)
-- 5 minute work ;-) (actually was a bit more because of import mess)
a consistent testing framework
The whole package is full of ad hoc test suites. A main issue is that standard testing tools (at least the ones I know of) are totally unpractical with such an application for which both input and output typically are big and complicated texts. At the very minimum, the tests may be cleaned a bit; reformatted when needed to ensure a common interface; and integrated into one or more main test control functions, probably one for the library and one for the generator.
a collection of sample applications
Contributions welcome! ;-)
a continuation code to allow patterns be expressed on 2 or more lines
This is even more pertinent for pijnu
, due to the 3rd column for transformations. Also, not everybody preferes (as I do) splitting complex expressions into sub-patterns; even if this has obvious advantages in terms of simplicity, clarity, and naming.
Probable syntax would use '...'
, possibly repeated at start of continuation line:
pat : (a long pattern format) / (even longer here) / ...
... (continued on next line)
-- or --
pat : (a long pattern format) / (even longer here) / ...
(continued on next line)
builtin transformation for separated lists
Should yield a simple sequence of items, without the need of separate drop
, liftValue
, and extract
transformations. Should the separators be dropped before (anyway they usually are for the same are used in numerous patterns)? Or not at all (then often need be copied without drop
)? Or the transform should be flexible & adapt to both cases?
Done: intoList
extract matched items into a flat sequence, meaning a simple branch node. The minimum number of items can be 1, 2, or more. Separators can be dropped or not.
The issue is that the parse tree is messy.
itemList : item (SEP item)*
==>
itemList:[item [[SEP item] [SEP item]...]]
The only solution is a combination of drop (on SEP), liftValue (on each additional item), extract (on the sequence of additional items).
We can now get directly what we want using intoList
.
itemList : item (SEP item)* : intoList
==>
itemList:[item item item...]
intoList
will also deal with patterns for operations that allow more than 2 operands.
operation : oper OP oper (OP oper)* : intoList
==>
operation:[oper oper oper oper...]
integrate AnyChar pattern for '.' in grammar
Has been left out earlier and forgot to put it back -- mainly for I never use it. See ValidChar below to know why.
pijnu
probably still needs it if only for people used to other variants of PEG.
a variant of AnyChar to allow only valid characters
The issue is that in most cases not any char is a valid char.
The pattern class is ready. Just find a nice way to integrate it in the grammar.
The klass of valid characters is specified through a configuration parameter: actually the attribute ValidChar.CONFIG
. So ValidChar should be "officially" introduced together with configuration below.
A nice solution may be to combine it with AnyChar: when no character class is specified, the pattern acts like AnyChar, else it acts like ValidChar. Then, we don't need any special sign in pijnu
language for ValidChar.
add an "until" STOP condition to repetitions (see this topic in the guide, section practical use -> more on patterns -> LookAhead) The purpose is to avoid ugly things like:
pat : (!"HALT" [a..z A..Z])+ "HALT"
pat : (!"X"{3} ("X" / "Y" / "Z"))* "X"{3}
And have the following instead, using the sign '>'
:
pat : [a..z A..Z]+>"HALT"
pat : ("X" / "Y" / "Z")*>"X"{3}
But in practice the need for it seems to be rather rare.
config parameters inside the grammar itself
There are some configuration parameters to allow a certain (sensible?) level of customization. Typically, they are attributes of the concerned type. First write a reference of all existing ones.
- pattern, node, and exception output
- pre-process
- special patterns such as AnyChar/ValidChar ('.').
Need more?
Then have a syntax to specify them inside the grammar; possibly anywhere and repetitively, for some may change (e.g. ValidChar's PATTERN param). This would mean (& look like) setting an attribute on the proper type.
Node.SHOW_TREE = True
Some may define a pattern, so they must be parsed, indeed.
ValidChar.PATTERN = [\x20..\x7e \n\t]
allow specifying pre-process functions in grammar
To be applied on source before parsing, in order to help having a clear grammar and/or avoid context-dependant issues. Typical cases.
- normalize separators to a single form (eg all whitespace to a single space)
- transform indented structure to wrapped structure in block open & close tokens
- join or split lines according to continuation or separator tokens
Some in-built tranform functions exist already in the package, namely in the preprocess.py
module: they can be manually called.
Pre-processes would build another section of the grammar, coming between and . Like for the case of transformations, some pre-process functions would be in-built, while the definition of custom ones would come in the section.
Problem: the calls to pre-processes need to be collected into the parser object in order to be properly applied before parsing. The issue is that this cannot work when an individual pattern is accessed & used for matching, instead of the parser itself. This may lead to confusion due to seemingly inconsistent results or even 'random' parse failures.
Thus, it may be better to forget this all, simply document the builtin pre-processes and let the user call these builtin functions or custom ones 'manually'.
a func to detect recursive patterns
To avoid the need for the user to add a special code '@'
denoting such patterns. Not a major issue -- in fact, this can well be seen as a help to better understand the grammar...
There may also be a function to detect 'wrong' recursion: when a recursive pattern written on left side of a format leads to infinite recursion.
Note by Peter17: This list should be verified to be up-to-date.
post-match func Transform a branch node into a leaf which value is a collection (list? tuple?) of its previous child values. Handy for simplifying a parse tree and/or yielding custom node values.
in the meta-parser toolset The repetitionCode transformation that writes a Repetition or String expression copes both with the kind of repetition suffix (*,+,{n},{m,n}) and with the kind of base pattern (Klass --> String, other --> Repetition). Split that complexity into two transformations, by a adding repetSuffixCode transformation.
Allow directly naming at pattern creation (param).
in the meta-parser This should come as a general Pattern creation param (not for each pattern type).
in the meta-parser Instead of naming patterns at parser creation, store the name at parse time and add it as param in the Pattern init (like original format).
Extract all test suites into separate module.
Add a param to test func(s) to have shortest output when all works fine.
Have both a deep (all classes, methods, funcs) and superficial (user-level) test suites.
A parser/pattern method to run a test suite from a dict. A dict param holds {source:result} pairs that will be asserted equal. None result means expected failure. 2 levels of output. Intended to check that modifitations do not break fonctionality. Veeery handy ;-)
A parser/pattern method to build, print and return a {source:result} dict from a list of source expressions. Either to later use in testSuite -- then should be used only once when all runs fine; or simply to run a set of tests on the same pattern and check the nicely presented output.
in getPattern() Allow named pattern format: "digit : [0..9]"
in writeParser (-->makeParser) Return parser object. Then, rename writeParser to makeParser.
Simplify parser creation for the user, when grammar is in a file. Several possibilities: . as is: user must read file to pass grammar text . detect "*.pijnu" as 'grammar' param . detect a file object as 'grammar' param . add 'from_file' flag arg . create distinct makeParserFromFile func that would then call makeParser . tool func to read file content (and close it) Done: added fileText() and writeFile() funcs to python "tools" module; both imported by the generator; fileText exported.
Allow TAB everywhere in spacing inside the central format part of a pattern def!!! Maybe allow non breakable space, too?
comprises title & free text. Allow blank & comment lines before & after title.
to Node.data or Node.content? Seems value is OK.
to be simplified & cleaned up esp. remove FULL_OUTPUT take care of recursivity
Extract samples from test part of generator module, rewrite and check them. Export them into /samples. This can possibly make a superficial (user-side) test suite using testStuite.
A genTest.py for generator.py already exists. Just add another sample, with a different pattern combination, for makeParser; and a simple one for getPattern.
in Choice's new method When Choice patterns all are klasses or chars, compose a "super" klass, instead of yielding a general choice.
Change it so that Pattern repr() becomes repr of instanciation (eg "Char('a')")? Idem for Node? But in the case of a node, the actual value may be changed by transforms. But they are applied too in init, so it may not be an issue.