Skip to content
peter17 edited this page Jun 13, 2011 · 7 revisions

A python PEG parser generator and processor intended to be clear, easy, practical

pijnu was created in 2009 by Denis "Spir" Derman and then transferred to Peter Potrowl (peter17 on GitHub) in June 2011.

This document was created by Spir. So, in the following text, "I" refers to him.

overall pointers

documents

current release

release history

26 may 2009

changes: ~ Split the package in organized sub-packages: library, generator, doc, samples. ~ Consistent import system (still little things are unclear). ~ Added a module gen.py to create a parser from command line. See section on "evolution" for details.

19 may 2009

major change: Fixed a bug that made parsing generation 50X slower ;-) Due to the fact that wrapper patterns recorded sub-pattern error messages instead of references to error objects. This made all error messages to be written [this is (too) slow] for nearly all match failures, while normally none of them will be output (~ 10000 failures for pijnu's own meta-parser). Also added statistics at parser creation.

11 may 2009

major change: List transformation function, called intoList. Now, the user does not need to cope with detail transformations (drop, extract, liftValue) on sub-patterns for a separated list. Typically, the scheme is as follows.

itemList  :  item (SEP item)*  :  intoList

will not yield anymore results like:

itemList:[item  [[SEP  item]  [SEP  item]...]]

but instead directly:

itemList:[item  item  item...]

intoList also works when the minimum number of items is more than 1, as is usual for operations with 2 or more operands:

sum  :  operand PLUS operand (PLUS operand)*  : intoList

will now yield results like:

sum:[operand:1  operand:2  operand:3...]

04 may 2009

first real public release Introduced the user guide, also online at guide. Fixed some unimportant details while writing the user guide. Most issues remain, but pijnu is globally nicely usable as is.

16 may 2009

first upload to site Numerous issues remain: see section on "evolution" below.

useful literature

What is Parsing Expression Grammar?

See also below section on "grammar & parser" for a first overview.

grammar & parser

pijnu syntax is a custom, extended version of Parsing Expression Grammars (PEG -- see links above); which itself is a kind of mix of BNF and regular expressions.

The major difference is that PEG is a grammar to express string recognition patterns, while BNF or regexp express string generation. As a consequence, PEG is better suited for parsing tasks. A PEG grammar clearly encodes the algorithm to parse a source string, that simply needs to be rewritten into a parser coded in a programming language.

pijnu generated parsers use a library to parse source texts. This library mainly holds pattern classes, a Node type for the resulting parse tree, and several kinds of tool functions.

The parser is produced from the grammar by a generator which, indeed, itself "meta-uses" the library. For the story, all but the first generator were themselves "meta-produced" by previous versions of the generator: pijnu is bootstrapped.

### simple arithmetics with '+' and '*' only
formula
<definition>
# tokens
ADD        : '+'
MULT       : '*'
LPAREN     : '('
RPAREN     : ')'
digit      : [0..9.]
number     : digit+
# operations
mult       : (grup/number) MULT (grup/mult/number)
add        : (grup/mult/number) ADD (grup/add/mult/number)
grup       : LPAREN (add / mult) RPAREN
formula    : add / mult / number

post-process & transformations

A parsing phase produces a parse tree in which every node was yielded by a pattern. Simple leaf nodes hold the matched string snippet while branch nodes contain a sequence of child nodes. A major issue in text processing applications is that a raw parse tree is far from having a form well suited for further processing.

Using pijnu, one can do far more that getting a parse tree. The grammar allows assigning transformation functions to patterns, that will then be applied to every generated node. Numerous in-built transformations are provided in order to easily restructure the resulting parse tree and/or modify specific nodes.

Moreover, a user can write custom functions right inside the grammar that will then be applied to directly perform further processing. This is a both very simple and highly powerful method. In most cases, one can get final results "like magic".

For instance, to compute the actual result from the above formula grammar, one needs only 2 single-line functions: one for each operation indeed. Then, the result of the parsing/processing process is the result of the expressed formula.

Another example that will generate XHTML from wiki-text styled lines (possibly nested), using a single 3-lines function:

### parse wiki-text styled lines and rewrite them into XHTML
wikInline
<toolset>
def styledSpan(node):
    klass = node.typ
    text = node.value
    node.value = '<span class="%s">%s</span>' %(klass,text)

<definition>
# codes
    ESCAPE         : '~'
    DISTINCT       : "//"                                : drop
    IMPORTANT      : "**"                                : drop
    styleCode      : (DISTINCT / IMPORTANT)
# character expression
    escChar        : ESCAPE ('*' / '!' / '/' / ESCAPE)   : join
    validChar      : [\\x20..\\xff  !!/!*~]
    rawText        : (escChar / (!styleCode validChar))+ : join
# text kinds
    distinctText   : DISTINCT inlineText DISTINCT        : liftValue
    importantText  : IMPORTANT inlineText IMPORTANT      : liftValue
    styledText     : distinctText / importantText        : styledSpan
    inlineText     : (styledText / rawText)+             : @ join

The column on right side assigns transformations to patterns. drop, join, and liftValue are builtin. styledSpan is a custom transformation. '@' denotes a recursive pattern.

practical use

See the guide & tutorial from the link above for details.

As a tool, pijnu is hopefully clear and efficient for the user. It provides highly informative feedback -- maybe too much ;-) even -- about patterns, results and exceptions. Custom extensions from PEG help defining legible grammars -- there may be more in the future. There are also pre-processing functions and configuration parameters that may be worthful in practical cases, but still need be fully integrated (see below section on "evolution").

Typically, a user will define the grammar, import the generator and let it write a corresponding parser. This parser comes in the form of a python module from which a parser object can be imported. The said parser object and each of its patterns can be used to match a source text partially or completely, find first or all occurrences of matches, or replace found matches. In most cases, transformation will restructure and further process the resulting parse tree.

from pijnu import generator
generator.writeParser(myGrammar)
from myGrammarParser import myGrammarParser
myGrammarParser.match(source)

New: It is now possible to directly produce a parser from the command line using the gen.py module (later may be renamed to pijnu.py):

python gen.py myGrammar.pijnu myParser.py

evolution

Here is a sketch of possible future evolution of pijnu as a list of TODO, questions & issues.

Note by Peter17: This list should be verified to be up-to-date.

license -- DONE

As indicated on this page and on each file of the software, pijnu is distributed under the free license GPLv3.

machine run time -- LEFT OUT

should pijnu be optimized for speed?

pijnu has been designed as a clear and practical tool. Every trade-off was made according to these principles, which should allow for fast use, in human time. Conversely, machine run time speed has never been considered as a relevant goal. As a consequence, pijnu seems to be very slow.

I personally have no skill for this; and also no motivation as of now. Comments, critical use cases, & contributions welcome.

package & import -- DONE

a clear packaging and importing system

There is no packaging using __init.py__ file(s) as of now. While I still haven't figured out how they are supposed to help the developer or the user (all works fine anyway), it may be a good idea at least to comply with standards. The whole application should be split into library, generator, samples, and doc folders. Imports should be clarified to allow importing either the generator to create parsers, or else the library rather for exploration (usually the lib is not needed, for the parsers import it transparently), or even both.

Also, there should be a way to allow the user import all names from library modules directly through import * ; which means that a lib module, maybe __init__.py, should import all of them.

command line -- DONE

add simple & fast command line mode

python pijnu.py grammarFileName

should generate a parser in grammarTitleParser.py

python pijnu.py grammarFileName parserFileName

should generate a parser in parserFileName

To be added in pijnu.py, or maybe __init__.py, top-level source file. (actually in gen.py because of import mess) -- 5 minute work ;-) (actually was a bit more because of import mess)

test -- TODO

a consistent testing framework

The whole package is full of ad hoc test suites. A main issue is that standard testing tools (at least the ones I know of) are totally unpractical with such an application for which both input and output typically are big and complicated texts. At the very minimum, the tests may be cleaned a bit; reformatted when needed to ensure a common interface; and integrated into one or more main test control functions, probably one for the library and one for the generator.

samples

a collection of sample applications

Contributions welcome! ;-)

line continuation -- TODO

a continuation code to allow patterns be expressed on 2 or more lines This is even more pertinent for pijnu, due to the 3rd column for transformations. Also, not everybody preferes (as I do) splitting complex expressions into sub-patterns; even if this has obvious advantages in terms of simplicity, clarity, and naming.

Probable syntax would use '...', possibly repeated at start of continuation line:

pat   : (a long pattern format) / (even longer here) / ...
        ... (continued on next line)

-- or --

pat   : (a long pattern format) / (even longer here) / ...
        (continued on next line)

list transform -- DONE

builtin transformation for separated lists

Should yield a simple sequence of items, without the need of separate drop, liftValue, and extract transformations. Should the separators be dropped before (anyway they usually are for the same are used in numerous patterns)? Or not at all (then often need be copied without drop)? Or the transform should be flexible & adapt to both cases?

Done: intoList extract matched items into a flat sequence, meaning a simple branch node. The minimum number of items can be 1, 2, or more. Separators can be dropped or not.

The issue is that the parse tree is messy.

itemList	: item (SEP item)*
==>
itemList:[item  [[SEP  item]  [SEP  item]...]]

The only solution is a combination of drop (on SEP), liftValue (on each additional item), extract (on the sequence of additional items).

We can now get directly what we want using intoList.

itemList	: item (SEP item)*			: intoList
==>
itemList:[item  item  item...]

intoList will also deal with patterns for operations that allow more than 2 operands.

operation	: oper OP oper (OP oper)*	: intoList
==>
operation:[oper  oper  oper  oper...]

AnyChar -- TODO

integrate AnyChar pattern for '.' in grammar

Has been left out earlier and forgot to put it back -- mainly for I never use it. See ValidChar below to know why. pijnu probably still needs it if only for people used to other variants of PEG.

ValidChar -- TODO

a variant of AnyChar to allow only valid characters

The issue is that in most cases not any char is a valid char.

The pattern class is ready. Just find a nice way to integrate it in the grammar. The klass of valid characters is specified through a configuration parameter: actually the attribute ValidChar.CONFIG. So ValidChar should be "officially" introduced together with configuration below.

A nice solution may be to combine it with AnyChar: when no character class is specified, the pattern acts like AnyChar, else it acts like ValidChar. Then, we don't need any special sign in pijnu language for ValidChar.

"STOP repetition" -- LEFT OUT

add an "until" STOP condition to repetitions (see this topic in the guide, section practical use -> more on patterns -> LookAhead) The purpose is to avoid ugly things like:

pat : (!"HALT" [a..z A..Z])+ "HALT"
pat : (!"X"{3} ("X" / "Y" / "Z"))* "X"{3}

And have the following instead, using the sign '>' :

pat : [a..z A..Z]+>"HALT"
pat : ("X" / "Y" / "Z")*>"X"{3}

But in practice the need for it seems to be rather rare.

config -- TODO

config parameters inside the grammar itself

There are some configuration parameters to allow a certain (sensible?) level of customization. Typically, they are attributes of the concerned type. First write a reference of all existing ones.

  • pattern, node, and exception output
  • pre-process
  • special patterns such as AnyChar/ValidChar ('.').

Need more?

Then have a syntax to specify them inside the grammar; possibly anywhere and repetitively, for some may change (e.g. ValidChar's PATTERN param). This would mean (& look like) setting an attribute on the proper type.

Node.SHOW_TREE = True

Some may define a pattern, so they must be parsed, indeed.

ValidChar.PATTERN = [\x20..\x7e \n\t]

pre-process -- LEFT OUT

allow specifying pre-process functions in grammar

To be applied on source before parsing, in order to help having a clear grammar and/or avoid context-dependant issues. Typical cases.

  • normalize separators to a single form (eg all whitespace to a single space)
  • transform indented structure to wrapped structure in block open & close tokens
  • join or split lines according to continuation or separator tokens

Some in-built tranform functions exist already in the package, namely in the preprocess.py module: they can be manually called.

Pre-processes would build another section of the grammar, coming between and . Like for the case of transformations, some pre-process functions would be in-built, while the definition of custom ones would come in the section.

Problem: the calls to pre-processes need to be collected into the parser object in order to be properly applied before parsing. The issue is that this cannot work when an individual pattern is accessed & used for matching, instead of the parser itself. This may lead to confusion due to seemingly inconsistent results or even 'random' parse failures.

Thus, it may be better to forget this all, simply document the builtin pre-processes and let the user call these builtin functions or custom ones 'manually'.

recursion -- LEFT OUT

a func to detect recursive patterns

To avoid the need for the user to add a special code '@' denoting such patterns. Not a major issue -- in fact, this can well be seen as a help to better understand the grammar... There may also be a function to detect 'wrong' recursion: when a recursive pattern written on left side of a format leads to infinite recursion.

Ideas and TODO

Note by Peter17: This list should be verified to be up-to-date.

collectValues -- DONE

post-match func Transform a branch node into a leaf which value is a collection (list? tuple?) of its previous child values. Handy for simplifying a parse tree and/or yielding custom node values.

Repetition & String -- DONE

in the meta-parser toolset The repetitionCode transformation that writes a Repetition or String expression copes both with the kind of repetition suffix (*,+,{n},{m,n}) and with the kind of base pattern (Klass --> String, other --> Repetition). Split that complexity into two transformations, by a adding repetSuffixCode transformation.

naming from code -- DONE

Allow directly naming at pattern creation (param).

original format storing -- DONE

in the meta-parser This should come as a general Pattern creation param (not for each pattern type).

naming patterns -- DONE

in the meta-parser Instead of naming patterns at parser creation, store the name at parse time and add it as param in the Pattern init (like original format).

separate testing -- PROGRESS

Extract all test suites into separate module.

testing output size -- DONE for testSuite

Add a param to test func(s) to have shortest output when all works fine.

testing levels -- LATER

Have both a deep (all classes, methods, funcs) and superficial (user-level) test suites.

testSuite -- DONE

A parser/pattern method to run a test suite from a dict. A dict param holds {source:result} pairs that will be asserted equal. None result means expected failure. 2 levels of output. Intended to check that modifitations do not break fonctionality. Veeery handy ;-)

testSuiteDict -- DONE

A parser/pattern method to build, print and return a {source:result} dict from a list of source expressions. Either to later use in testSuite -- then should be used only once when all runs fine; or simply to run a set of tests on the same pattern and check the nicely presented output.

named pattern creation -- DONE

in getPattern() Allow named pattern format: "digit : [0..9]"

parser direct creation -- DONE

in writeParser (-->makeParser) Return parser object. Then, rename writeParser to makeParser.

parser from file -- DONE

Simplify parser creation for the user, when grammar is in a file. Several possibilities: . as is: user must read file to pass grammar text . detect "*.pijnu" as 'grammar' param . detect a file object as 'grammar' param . add 'from_file' flag arg . create distinct makeParserFromFile func that would then call makeParser . tool func to read file content (and close it) Done: added fileText() and writeFile() funcs to python "tools" module; both imported by the generator; fileText exported.

spacing in grammar -- DONE

Allow TAB everywhere in spacing inside the central format part of a pattern def!!! Maybe allow non breakable space, too?

better grammar header -- DONE

comprises title & free text. Allow blank & comment lines before & after title.

rename Node.typ to .tag -- DONE

rename Node.value -- ???

to Node.data or Node.content? Seems value is OK.

pattern output methods -- TODO

to be simplified & cleaned up esp. remove FULL_OUTPUT take care of recursivity

samples from generator -- TODO

Extract samples from test part of generator module, rewrite and check them. Export them into /samples. This can possibly make a superficial (user-side) test suite using testStuite.

genTest expansion -- TODO

A genTest.py for generator.py already exists. Just add another sample, with a different pattern combination, for makeParser; and a simple one for getPattern.

super klass -- DONE

in Choice's new method When Choice patterns all are klasses or chars, compose a "super" klass, instead of yielding a general choice.

move Parser into pattern.py?

repr() -- TODO

Change it so that Pattern repr() becomes repr of instanciation (eg "Char('a')")? Idem for Node? But in the case of a node, the actual value may be changed by transforms. But they are applied too in init, so it may not be an issue.