-
-
Notifications
You must be signed in to change notification settings - Fork 6
User's Guide
pijnu
was created in 2009 by Denis "Spir" Derman and then transferred to Peter Potrowl (peter17 on GitHub) in June 2011.
This document was created by Spir. So, in the following text, "I" refers to him.
- rename "transformation" to "match action"!
- explain "magic methods" allowing to access child nodes directly in transformations (eg
(key,val) = node
instead of(key,val) = (node[0].value,node[1].value)
- add explanation of the format for Char-s and Klass-es! (escaped chars, decimal and hex codes, ranges, excluded chars and charsets, double-space as visual separator).
- clarify transformations, state actions & validations
- write the section on builtin transforms
- write a section on validation
- write the section on testing
This document is still in progress.
pijnu
is a parsing and processing tool intended to be clear, easy, fast in programmer time.
This guide & tutorial is targeted toward people with about no knowledge on parsing, but some skill in programming Python. If you're an experienced user of parser generators or parsing frameworks, you should read this doc fast, but completely, in order to avoid overlooking pijnu
specificities and get to know how it can help you at best. If you're knew to parsing instead, this guide may deal as a gentle introduction to this programming field, too.
If not done already, download the current version of pijnu
by typing the line below (nothing to change).
git clone [email protected]:peter17/pijnu.git
The application is autonomous and there is no installation to perform. The best option is indeed to put it somewhere your PYTHONPATH
knows, so that you can import it from anywhere.
It has been tested only on a Linux Ubuntu 8.10 box with python versions of the 2.6 series. However, it should run on any platform with a not-too-rusty Python release -- but not with Python 3 yet.
Imagine we want to parse source texts simply composed of space-separated integer or real numbers. WE need to describe how a number can be written, as well as the overall structure of a document; both together define the format of valid sources texts. I will call grammar this description and source an input text that should comply with the grammar.
Our aim is to get a parser, meaning a program able to
- check the validity of a source
- produce structured and informative data about the content found in the source
- process this data further so that it fits better our needs
The kind of data structure generated is called parse tree, or concrete syntax tree. You'll soon understand why ;-)
Before writing a whole grammar, we will explore the notion of pattern.
Open a console, move to the directory where you put pijnu
if necessary, then launch Python.
First, let us define a pattern that should match any Latin lowercase letter:
>>> from pijnu import getPattern
>>> letter = getPattern("letter: [a..z]")
... creating pattern from
letter: [a..z]
--> Klass letter:[a..z]
>>> type(letter)
<class 'pijnu.library.pattern.Klass'>
>>> letter.charset
'abcdefghijklmnopqrstuvwxyz'
Right. We have a pattern object of the type Klass
, meaning it should match any character of the specified class. Let's use it, now:
>>> letter.match('a')
letter:'a'
>>> letter.match('cba')
letter:'c'
>>> letter.findFirst("123456789uvw987654321")
letter:u
>>> letter.findFirst("123456789%%%987654321")
None
>>> letter.findAll("a123jkl456xy789z")
[letter:'a' letter:'j' letter:'k' letter:'l' letter:'x' letter:'y' letter:'z']
>>> letter.findAll("123456789")
[]
>>> letter.replace("a123jkl456xy789z",'?')
'?123???456??789?'
The following trial should fail.
>>> letter.match('123456789')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/spir/prog/pijnu/library/pattern.py", line 181, in match
return self._check(source, 0)
File "/home/spir/prog/pijnu/library/pattern.py", line 408, in _check
raise result
pijnu.library.error.MatchFailure:
*********************************
Match failure for pattern
letter:[a..z]
in source text at location
index #0 line 1 character 1
123456789
^
Cannot find any char member of class letter:[a..z].
*********************************
Now, another kind of pattern.
>>> keyword = getPattern('keyword: "FOOZZZ"')
... creating pattern from
keyword: "FOOZZZ"
--> Word keyword:"FOOZZZ"
>>> keyword.word
'FOOZZZ'
>>> keyword.findAll("123FOOZZZ456FOOZZZ789")
[keyword:'FOOZZZ' keyword:'FOOZZZ']
>>> keyword.replace("123FOOZZZ456FOOZZZ789", " ;-) ")
'123 ;-) 456 ;-) 789'
This one pattern is of the type Word
: it should match only a literal, constant, bit of text, id est a word. Of course, this word can hold any character.
Now, let us complicate a little bit the format: we will define a choice and a sequence.
>>> ch = getPattern('ch: [a..z] / "FOOZZZ"')
... creating pattern from
ch: [a..z] / "FOOZZZ"
--> Choice ch:[a..z] / "FOOZZZ"
>>> ch.patterns
[Klass <?>:[a..z], Word <?>:"FOOZZZ"]
>>> ch.findAll("123x456FOOZZZ789")
[<?>:'x' <?>:'FOOZZZ']
>>> seq = getPattern('seq: "FOOZZZ" [a..z]')
... creating pattern from
seq: "FOOZZZ" [a..z]
--> Sequence seq:"FOOZZZ" [a..z]
>>> seq.patterns
[Word <?>:"FOOZZZ", Klass <?>:[a..z]]
>>> seq.findAll("FOOZZZ123a456FOOZZZz789")
[seq:[<?>:'FOOZZZ' <?>:'z']]
A Choice
pattern succeeds when meeting a snippet of text that matches either possibility specified by sub-patterns. Actually, there may be more than two. Its code is a slash '/'
; if you're used to regular expressions, forget the sign '|' for a while.
A Sequence
pattern succeeds when meeting a snippet of text that matches both sub-patterns in a row. Again, there may be more than two. Its code is a simple space ' '
.
Note that sub-patterns are here anonymous, which is denoted by the '<?>'
question marks. In a real grammar, we could -- and probably would -- have given them a name.
As of now, we have played with single patterns, letting them match source text snippets. They let you check source strings, extract parts of them, performs replacements. They act exactly like regular expression patterns, except the grammar is (hopefully ;-) much nicer. Still, pijnu
patterns as well as regexes have limitations, mainly they do not allow you split a complex pattern into simpler ones: meaning write a grammar.
It is time now to cope with the grammar for a list of numbers. Composing it will consist in defining a set of related patterns. The whole grammar will actually be a hierarchical system of patterns.
First, a number is made out of digits in the range of 0..9. We'll note that as follows.
digit : [0..9]
An integer, here unsigned, is composed of a sequence of digits, at least one.
integer : digit+
The sign '+' denotes one or more repetition(s).
Then, a real number is like two integers separated by a fractional point.
real : integer '.' integer?
As you see, sub-patterns put one after the other express a sequence.
Here, I chose to allow reals not to have any fractional part, by using '?'.
A number is either a real or an integer.
number : real / integer
where '/' expresses a choice. If you're used to regex, forget the sign '|' for a while.
Now, a series of numbers may be expressed as:
numbers : number (' ' number)*
Unlike '+', '*' is a repetition with possibly 0 occurrence; in other words, an optional repetition, or "zero, one, or more". What is repeted here is a mini-sequence of a space separator, followed by a number.
The whole pattern is thus a way to express a space-separated list of numbers.
The full grammar then looks like this:
numbers
<definition>
SEP : ' '
POINT : '.'
digit : [0..9]
integer : digit+
real : integer POINT integer?
number : real / integer
numbers : number (SEP number)*
Notes
-
'numbers'
on top is the title of the grammar. - The definition part starts with . This probably means there may be other parts ;-)
- Definitions for constant patterns are good practice for clarity.
I wrote above that a regex cannot be split, and a grammar instead is a composition of patterns. Now, we could indeed write the grammar as a single pattern: just need to recursively replace each pattern name by its definition. Right? Just do it for fun...
Advantages of composition are
- simplicity
- flexibility
- naming!
- testing
- evolution
Open your favorite editor, copy & paste the grammar, give it whatever name you like and save it. I'll call the grammar numbersGrammar.pijnu
.
Note: if pijnu
is not in a directory your PYTHONPATH
knows, you'll have to put the grammar in the same directory.
Instead of using the command-line, which is not very handy for text edition, we will imagine that the parsing task is part of a larger application we are developing -- which indeed rather matches real use.
So, open a new Python source file. We will use create & use the parser from a Python script. Here is what we need to code.
- import
pijnu
's parser generator called "makeParser" - create a parser module & get the parser object
- use it to check it works on an example source
Here is my version of this process.
# get the parser
from pijnu import makeParser
numbersGrammar = file("numbersGrammar.pijnu").read()
numbersParser = makeParser(numbersGrammar)
print numbersParser
# use it
source = "123 4. 5.67"
numbersParser.test(source)
Notes
-
numberGrammar
is a text, meaning a string for Python. For such a little language, we could write the grammar directly in code. - The parser object is called
*Parser
, where*
is the title of the grammar: in our case we get "numbersParser". - The parser module will have the same name as the object, plus a '.py' extension: "numbersParser.py".
Then, we test the parser against an example source. test
is a method that will (try and) perform a match, and additionally
- output the result and some more info in a legible form (feedback welcome),
- in case of failure, catch the exception, print it, and go on instead of halting,
- allow you specify another matching method
You should get the following output:
... parsing grammar ...
... composing code ...
... writing file ...
<'numbers' Parser object -- from file 'numbersParser.py'>
---------------------------------
pattern: numbers:number (SPACE number)*
source: '123 4. 5.67 -1'
method: matchTest
RESULT:
numbers:
integer:123
<?>:
<?>:
SPACE:
real:
integer:4
POINT:.
<?>:
SPACE:
real:
integer:5
POINT:.
integer:67
---------------------------------
Awa! Complicated? Don't worry, the result will soon look much simpler. Anyway, if you're very smart, you'll figure out that this is exactly what you asked for ;-).
For another output format, change the source or simply add the line.
print numbersParser.match("111 1. 1.11")
This will print out.
numbers:[integer:'111' <?>:[<?>:[SPACE:' ' real:[integer:'1' POINT:'.']] <?>:[SPACE:' ' real:[integer:'1' POINT:'.' integer:'11']]]]
It's exactly the same result; simply the tree view is somewhat nicer to my eyes.
Some comments on this result
- Each node item is tagged using the pattern's name: in fact, the tag defines the type of node resulting of a match found in source.
- On the right side of a tag is the value of every node.
- The simple leaf nodes, such as integers are the ones that actually eat characters in source. Their value is originally the snippet of text matched.
- Some nodes called branches are compound. They are generated by sequential or repetitive patterns. Here number and real nodes are branches. In
pijnu
, their value is set to the sequence of child nodes.
So what are those '<?>'
, now? They are default tags for anonymous patterns. To understand this, just watch again the definition of numbers:
numbers : number (SPACE number)*
Here are 2 unnamed patterns: "(SPACE number)*"
and "SPACE number"
. The first <?>
stands for the repetition "(SPACE number)*"
; further <?>
stand for an instance of "SPACE number"
. Right?
Well, probably, this result does not really look like what you expected. The issue is that the constraints of parsing are different from the needs of further use. For instance, you need to specify delimiters that later are just dead weight and complicate the tree structure.
pijnu
provides means to address this problem. Let us apply transformations to the grammar, then run again and watch the new result:
numbersTransform
<toolset>
def toNumber(node):
node.value = float(node.value)
<definition>
SEP : ' ' : drop
DOT : '.'
digit : [0..9]
integer : digit+
real : integer DOT integer? : join
number : real / integer : toNumber
addedNum : SEP number
numbers : number (addedNum)* : intoList
Change numbersGrammar
to numbersTransformGrammar
in the testing code. An execution with this new version should output:
---------------------------------
pattern: numbers:number (addedNum)*
source: '123 4. 5.67 -1'
method: matchTest
RESULT:
numbers:
integer:123.0
real:4.0
real:5.67
---------------------------------
Much nicer. Try and use this grammar with other sources if you like.
As said, the former result is usually called a parse tree or concrete syntax tree. What we get now is rather close to an abstract syntax tree, or AST, that better reflects the meaningful structure and content of the original source; and is closer to what we need for further output or process.
Now, how does this magic work?
First, If you think very hard, you will note that a sequence of items, such as our numbers, is composed of one initial item followed by a sub-sequence of additional items, each of which preceded by a separator.
So that the original result looks like.
[item [[sep item] [sep item] [sep item]...]]
To get what we want, we need first to drop the (now useless) separators.
[item [[item] [item] [item]...]]
Now, we use a builtin transformation called intoList
to get the expected result out of that.
[item item item item...]
Note: actually, we could have let the separators, for intoList
detects and throws them anyway.
Another transformation used is join, that will glue together the leaf nodes of a branch into a single string. So that for a real number
real:[integer:[digit:'5'] DOT:'.' integer:[digit:'6' digit:'7']]
we get the string.
real:5.67
Which is the same a the original source, indeed; but in other cases child nodes may have been transformed before: for instance, integers may be converted to binary numbers. Integers do not need to be joined explicitly for they actually are String-s
(see section "klass & string" in part "practicle use")
Now, remains the case of toNumber
: in addition to in-built transformations, pijnu
lets you write your own custom ones directly, in the <toolset>
section of the grammar. As you can see, they are called automatically with the node itself as single parameter. Avoid messing up a node with its value. Nodes contain all additional information and all processing methods you may ever need -- so I guess ;-).
While transformations are usually very simple & short functions -- typically 1 to 3 lines of code --, they perform highly worthful actions on the results: see below the section "let pijnu
do the job".
Once you get the parser object, you're not obliged to use it all. Instead, all its patterns are available as attributes & can match independently: this is obviously great for testing. First, let's use its top pattern -- actually the last one.
numbersTransformParser.numbers.test("123 4. 5.67")
==>
<same result>
Simply because a parser is equivalent to its top pattern. Actually, when asked for a match, a parser will simply transfer the request to its top pattern.
Now let's use integer
.
numbersTransformParser.integer.test("987")
==>
---------------------------------
pattern: integer::digit+
source: '987'
method: matchTest
RESULT:
integer:987
---------------------------------
Right, but not very impressive. Let's change the method:
numbersTransformParser.integer.test("987 *** 654 \t\n 321", 'findAll')
==>
---------------------------------
pattern: integer::digit+
source: '987 *** 654 \t\n 321'
method: findAll
RESULT:
<seq>
integer:'987'
integer:'654'
integer:'321'
---------------------------------
findAll
scans the whole source for matches anywhere.
What happens when a match fails? We get a pijnu
custom exception of type "MatchFailure"
that brings hopefully clear feedback to the developer:
numbersTransformParser.digit.match("*6")
==>
pijnu.library.error.MatchFailure:
*********************************
Match failure for pattern
digit:[0..9]
in source text at location
index #0 line 1 character 1
*6
^
Cannot find any char member of class digit:[0..9].
Yes, the error message is rather verbose ;-) But this one was short, when compared to the following one:
numbersTransformParser.number.match("*6")
==>
pijnu.library.error.MatchFailure:
*********************************
Match failure for pattern
number:real / integer
in source text at location
index #0 line 1 character 1
*6
^
Cannot match any pattern in choice.
(wrapped pattern errors below)
Match failure for pattern
real:integer DOT integer?
in source text at location
index #0 line 1 character 1
*6
^
Cannot match all patterns in sequence.
(wrapped pattern error below)
Match failure for pattern
integer:digit+
in source text at location
index #0 line 1 character 1
*6
^
Cannot find minimal number of characters (1)
members of class digit:[0..9].
Match failure for pattern
digit:[0..9]
in source text at location
index #0 line 1 character 1
*6
^
Cannot find any char member of class digit:[0..9].
*********************************
OK -- let's decode.
The point is that number
is a complex pattern that expresses a choice (real or integer), of which one possibility (real) is itself a sequence (of integral part, dot, fractional part).
Each "wrapper" pattern, choice, option, sequence, or repetition, outputs its own accurate message, followed by sub-pattern errors
- A choice tells why every option failed.
- An option tells its sub-pattern failed, and why.
- A sequence tells which pattern failed, and why.
- A repetition tells it could not match the minimum number of repetitions(s) required, and why.
Now, read the output again. (feedback welcome)
Sometimes, the last lines will tell you exactly what you need to know. But this is not always true, especially when choices are involved. The suite of messages mirrors the logic the parser followed to try & match -- so that you can precisely figure out where and why you got an error.
How to correct, now? Indeed, the error is often hidden in pattern formats; but sometimes your test source instead is erroneous! Also, do not forget to test sources that are expected to fail, meaning negative tests that show your patterns correctly exclude wrong sources, especially border cases.
While speaking of testing, this will be an intensive activity when setting up grammars. pijnu
provides several helpers to cope with this: see also the section on "practicle use".
Imagine you want to test the digit
pattern. It would be nice to feed the pattern with a whole load of test strings, then check all results at once.
Instead of writing e.g. a digitSeq pattern, with these bloody separators, only for that test, use the findAll
method. You don't even have to give its name as a parameter to test; use it directly:
print numbersTransformParser.digit.findAll("9&8 7#6:5 \t\n 4321")
==>
[digit:'9' digit:'8' digit:'7' digit:'6' digit:'5' digit:'4' digit:'3' digit:'2' digit:'1']
Another solution for testing can be to use special methods called *Test
. The basic parse
, match
, findFirst
methods try to match and fail with an error. Corresponding *Test methods (matchTest
, parseTest
, findTest
) catch and output exceptions, then go on instead of stopping. This is very handy for test suites:
sources = ["123", "123a45", "123456789"]
for source in sources:
print numbersTransformParser.integer.parseTest(source)
==>
integer:123
*********************************
Parse failure for pattern
integer::digit+
Cannot match whole of source text.
Matching stopped at location
index #3 line 1 character 4
123a45
^
Partial result is
integer:'123'
*********************************
None
integer:123456789
While the second sample failed because the parser could not match the whole source, the test process goes on so that we get feedback on every sample case. This is especially appropriate because a typical parse test suite includes match trials that are intended to fail! We will come back to this later...
Another methods to get feedback for test suites is testSuiteDict
. Using it would produce the following result:
test_suite_dict = {
"123" : 123,
"123a" : none,
"123456789" : 123456789,
}
About test suites, pijnu
provides handy methods to directly build and run them -- see section on testing in the "practical use" part.
One can easily push transformations to get the final result expected. For instance, if the grammar parsed additions: one could get the result of the addition, instead of a representation of an operation, using such a transform (provided numbers are converted to ints or floats):
def doAdd(node):
node.value = node[0].value + node[1].value
Try and modify the grammar so that it is able to parse a single addition. Run it first without any transformation for the addition itself: this should output for instance:
addition:
number:1.1
operator:+
number:2.2
Or else, if you drop-ped the operator:
addition:
number:1.1
number:2.2
Remember you can use findAll
or testSuiteDict
, instead of test
, to scan several additions.
Then set doAdd
or a similar func to the addition
pattern and run again.
Final result may look like below.
# simplistic addition grammar
addition
<toolset>
def toNumber(node):
node.value = float(node.value)
def doAdd(node):
node.value = node[0].value + node[1].value
<definition>
# constants
SEP : ' '
DOT : '.'
ADD : '+' : drop
# operands
digit : [0..9]
integer : digit+
real : integer DOT integer? : join
number : real / integer : toNumber
# addition
addition : number ADD number : doAdd
additionGrammar = file("additionGrammar.pijnu").read()
additionParser = makeParser(additionGrammar)
sources = "1+1 -1+1 123.34+1 1.23+987.65".split()
additionParser.testSuiteDict(sources)
test_suite_dict = {
'1+1' : 2.0,
'-1+1' : None,
'123.34+1' : 124.34,
'1.23+987.65' : 988.88,
}
pijnu
is mainly composed of a parsing library on one side, and a parser generator on the other. Hope some information on how it is designed and works will help using it better. Also, it will allow you bring me worthful feedback ;-)
You will probably be lost somewhere in the middle of this section, especially if you're not familiar to parsing already. No problem, this will not prevent you to use pijnu
at all. Come back later if you like.
The library is highly inspired by the design of pyparsing [pyparsing.wikispaces.com/ home page]. I take the opportunity to thank Paul MacGuire for his help and encouragement at the starting times of this project.
The library simply holds types and functions used by generated parsers
- an overall Pattern type; and specific pattern types for each typical kind of pattern defined by
pijnu
's grammar (Word, Klass, Repetition, Choice, Sequence & more...) - a Node type for parse results (and full trees)
- a Parser type that mainly wraps a grammar's top pattern for convenience
- custom exceptions -- highly informative as you've seen already ;-)
- various & efficient testing methods, for in-progress grammars/parsers or parts of them
- a builtin toolset of transform funcs
- a builtin toolset of pre-process funcs
- builtin constant patterns for common chars, klasses & helpers (FIXME? GARBAGE?)
- configuration settings (TODO: integrate into grammar)
For the sake of information -- and fun:
As you may imagine, the generator indeed uses pijnu
iself to parse user grammars written in pijnu
with a kind of meta-grammar; and applies simple transforms to relevant nodes that rewrite them into Python code. (Actually, transforms are so practicle that the generator itself is ~ 15 lines long.)
Also, pijnu
is 'bootstrapped', meaning that the meta-grammar itself is produced using the generator -- a famous story about eggs and hens ;-)
I tried to illustrate these meta-stories.
text.lang [langParser.py] ==> textTree [process.py] ==> <result>
^
|
+---------------------------------------------------+
|
^
lang.pijnu [pijnuParser.py] ==> langTree [generator.py] ==> <langParser>
^
|
+---------------------------------------------------+
| |
v ^
pijnu.pijnu [pijnuParser.py] ==> pijnuTree [generator.py] ==> <pijnuParser>
- line: parsing & processing a source written in a user language
- line: parser generation for a user language from a grammar in
pijnu
- line: generation of
pijnu
's own meta-parser from its own grammar inpijnu
Chronologically speaking, things go backwards: we need first a pijnuParser used by the generator. This will allow producing a user parser. Then only can we apply this parser to sources. If you do not fully get this at once: don't worry (the same for me ;-).
Language is a term I use here in a wide sense; actually, what is described by a grammar can well be a very simple format such as the one for a list of numbers.
Now, the grammar is composed of patterns. These patterns are written according to another grammar, describing another language called pijnu
. Right? (The reason why we can use pijnu
's library to parse sources written in pijnu
, indeed!)
But a parser code is also composed of patterns. Yes, it's abusing the word 'pattern' to use it for distinct things ;-). Now, instead of being written in pijnu
, the parser's patterns are written in Python. But they carry the same semantics. So that you can well consider pijnu
as a kind of specialized, parsing-oriented, programming language.
Let's have look at numbersGrammar, again:
SPACE : ' '
POINT : '.'
digit : [0..9]
integer : digit+
real : integer POINT integer?
number : real / integer
numbers : number (SPACE number)*
We could well get rid of SPACE, and stop after digit, or integer, or number. Agreed? We would then simply have a smaller grammar, right? So that we can state that a pattern is simply a kind of mini-grammar, that implicitly holds wrapped sub-patterns. For instance, integer wraps digit; and number wraps integer and real. Conversely, a grammar is a kind of maxi-pattern, that explicitly holds patterns it depends on.
The same in code: each pattern is a mini-parser & a parser is a maxi-pattern. The reason why matching with a parser brings the same result as matching with the parser's top pattern (here the one special pattern called 'numbers').
Moreover: results mirror this duality. Originally (before any transform), an real
node is a sequence:
real:[integer:98 DOT:8 digit:765]
# or in tree view:
real:
digit:98
DOT:.
digit:765
If we use an real number parser, or the real
pattern of our numbersParser
, the result above will be the whole result tree: a node is a mini-tree. Conversely, a tree is a maxi-node. pijnu
is internally built that way.
We have evoked already the distinction between branch nodes (like real) and leaf nodes (like digit).
First, this reflects only the original situation: when an rel node gets join-ed, for instance, it becomes indeed a terminal leaf node. Meaning it has now a single value and does not hold any more child nodes.
Also note again that leaf nodes are yielded by terminal patterns that really consume characters from the source: they are 'Word' and 'Klass' pattern types. In pijnu
, terminal patterns are a bit more diverse -- see below inside the "practical use" section.
Conversely, there are wrapper patterns (please tell me whether 'wrap' & 'wrapper' are proper words for this notion -- more lexical feedback welcome, too ;-). Wrapper patterns do not actually match, but instead delegate the job to wrapped patterns. We have seen sequences, repetitions, options, and choices, which all are wrappers.
- A sequence or a repetition is sequential by nature: it can only yield a branch node -- which may well hold a single item, or even be empty.
- A choice or an option tries and match a single wrapped pattern; then return its result if successful, be it a leaf or a branch node.
As a summary:
pattern
terminal --> leaf nodes
Word: "abc"
Klass: [a..z]
wrapper
single --> leaf or branch nodes
Option: integer?
Choice: real / integer
compound --> branch nodes
Sequence: key ':' value
Repetition: letter+
Have you had a look at the generated numbersTransformParser.py
code file? You should!
It reads like that:
""" numbersTransform
<definition>
SEP : ' ' : drop
DOT : '.'
digit : [0..9]
integer : digit+ : join
real : integer DOT integer? : join
number : real / integer : toReal
addedNum : SEP number : liftNode
numbers : number (addedNum)* : extract
"""
from pijnu.library import *
numbersTransformParser = Parser()
state = numbersTransformParser.state
### title: numbersTransform ###
### <toolset>
def toReal(node):
node.value = float(node.value)
### <definition>
SEP = Char(' ', expression="' '",name='SEP')(drop)
DOT = Char('.', expression="'.'",name='DOT')
digit = Klass('0123456789', expression='[0..9]',name='digit')
integer = Repetition(digit, numMin=1,numMax=False, expression='digit+',name='integer')(join)
real = Sequence([integer, DOT, Option(integer, expression='integer?')], expression='integer DOT integer?',name='real')(join)
number = Choice([real, integer], expression='real / integer',name='number')(toReal)
addedNum = Sequence([SEP, number], expression='SEP number',name='addedNum')(liftNode)
numbers = Sequence([number, Repetition(addedNum, numMin=False,numMax=False, expression='(addedNum)*')], expression='number (addedNum)*',name='numbers')(extract)
numbersTransformParser._recordPatterns(vars())
numbersTransformParser._setTopPattern("numbers")
numbersTransformParser.grammarTitle = "numbersTransform"
numbersTransformParser.filename = "numbersTransformParser.py"
I'll let you parse the code yourself ;-)
Notes
- You've got a copy of the original grammar's definition section. (Handy for checking if ever...)
- The first code line obviously instantiates the parser object to be imported from client code. Several lines at the end specify it.
A parser is defined as follows:
def __init__(self, scope=None,
topPatternName=None,
grammarTitle=None,
filename=None):
scope
is used to explore the scope of the caller in search of Pattern objects, hence it is set to vars()
. The main purpose being to record patterns as parser attributes so that the user can invoke them conveniently.
- The top pattern is always the last one in the grammar. This is because patterns are objects referring to their wrapped sub-patterns. So that they must come in order. Like in Python code. Which the grammar becomes.
Let's use the command-line again:
>>> from numbersTransformParser import *
>>> print integer, type(integer)
integer:digit+ <class 'pijnu.library.pattern.String'>
>>> print number
number:real / integer
To understand better pijnu
's internals and get sure that the parser is only a translation from pijnu
to Python, we'll tweak the code itself. Let's change it so that it parses simple integer additions.
- get rid of spacing for the sake of grammar clarity
- forget reals
- define addition of two operands
My version is the following:
# .......
additionParser = Parser()
state = additionParser.state
### title: addition ###
### <toolset>
def doAdd(node):
(n1,n2) = (node[0].value,node[1].value)
node.value = int(n1) + int(n2)
### <definition>
# constants
SEP = Char(' ', expression="' '",name='SEP')(drop)
ADD = Char('+', name='ADD')(drop)
# operand
digit = Klass('0123456789', expression='[0..9]',name='digit')
integer = Repetition(digit, numMin=1,numMax=False, expression='digit+',name='integer')(join)
# operation
addition = Sequence([integer, ADD, integer], name="addition")#(doAdd)
additionParser._recordPatterns(vars())
additionParser._setTopPattern("addition")
additionParser.grammarTitle = "addition"
additionParser.filename = "additionParser.py"
Just write your own version, save as additionParseer.py
, append the lines below, and run:
sources = "1+1 -1+1 12334+1 123+98765".split()
additionParser.testSuiteDict(sources)
==>
test_suite_dict = {
'1+1' : [integer:'1' integer:'1']
'-1+1' : None
'12334+1' : [integer:'12334' integer:'1']
'123+98765' : [integer:'123' integer:'98765']
}
Now, include a func to perform the addition (in my case: unquote the end of the definition of the pattern addition
) and run again:
==>
test_suite_dict = {
'1+1' : 2
'-1+1' : None
'12334+1' : 12335
'123+98765' : 98888
Go back to command line and import -- or do the following trials by modifying the code and running each time:
>>> from additionParser import *
>>> addition.match("22+333")
addition:355
*** same as parsing with the parser object
>>> integer.match("987654321???fooooooo")
integer:'987654321'
*** we can match with lower-level patterns as well
>>> integer.match('-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/spir/prog/pijnu/pattern.py", line 119, in match
return self._check(source, 0)
File "/home/spir/prog/pijnu/pattern.py", line 236, in _check
raise result
error.Failure:
*********************************
Match failure for pattern
integer::digit+
in source text at location
index #0 line 1 character 1
-1
^
Cannot match 1 time(s) pattern: digit::[0..9].
(wrapped pattern error below)
Match failure for pattern
digit::[0..9]
in source text at location
index #0 line 1 character 1
-1
^
Cannot find char member of class digit.
*********************************
*** we forgot the sign: match failure with extensive output ;-)
>>> integer.findAll("123 +#*/ 456 plinkow 789 ")
[integer:'123' integer:'456' integer:'789']
*** findAll method
>>> integer.replace("123 +-*/ 456 plinkow 789" ,"###")
'### +-*/ ### plinkow ###'
*** replace method
>>> addition.findAll("1+2 \t 3+4 *** 9+8")
[addition:3 addition:7 addition:17]
*** yop!
In this section, we will explore further match methods, pattern types, testing & transformations.
A seen before, pijnu
provides several matching methods, for a given pattern as well as for a whole parser.
Here they are
-
match(source)
: try & match the pattern against the start of the source -
parse(source)
: try & match the pattern against the whole source -
findFirst(source)
: try & match the pattern anywhere inside the source -
findAll(source)
: collect matches for the pattern everywhere inside the source -
replace(source, newstring)
: replace matched occurrences of the pattern -- this usesfindAll
You may wonder why there isn't any more sophisticated replacement method, allowing further manipulation; especially to let you specify a function, like for regex: you do not need it at all. That's precisely what transformations do, in a much more powerful and flexible way, and what they are for. For instance, you could easily replace all decimal numbers in a text by their binary expression (just do it as an exercise...).
match
, parse
, and findFirst
have their matchTest
, parseTest
, and findTest
testing equivalent. The only difference is that these ones will not stop on a failure, instead just output the error message, so that the testing process can go on.
This is indeed intended to let the user easily write and run whole test series that won't stupidly halt on the first exception raised. You will have test cases that fail because all issues are not yet solved -- then its good to get feedback for several examples. You will have test suites that intentionally include samples against which matching should not succeed, especially border cases.
Alternatively, use test
to get a summary output as a reminder. This method is nice for test suites that remain in final code in prevision of further maintenance, evolution, or extension:
numbersParser.integer.test("123")
==>
---------------------------------
pattern: integer:: digit+
source: '123'
method: matchTest
RESULT:
integer:
digit:1
digit:2
digit:3
---------------------------------
Remember that test
lets you specify which method use for matching:
numbersParser.integer.test("123 456 789", 'findAll')
==>
---------------------------------
pattern: integer:: digit+
source: '123 456 789'
method: findAll
RESULT:
<seq>
integer:[digit:'1' digit:'2' digit:'3']
integer:[digit:'4' digit:'5' digit:'6']
integer:[digit:'7' digit:'8' digit:'9']
---------------------------------
Some specific pattern kinds require additional information.
A self-recursive pattern is one that calls itself; mutually recursive patterns are ones that call each-other. They form indeed a special case of grammar.
For some reason, pijnu
needs to know about recursion. As of know, there is no automatic detection yet (not such difficult). A user needs to denote them using '@'
at the very start of the transformation column.
Here are some examples:
### self recursion
# alternative to repetition
l : 'a'
x : (l x) / l : @
# wrapping inside (optional) (nestable) parents
y : '(' y ')' / l : @
### mutual recursion
# operation using parent grouping
grp : '(' oper ')' : @
oper : (group / num) '*' (group / num) : @
# idem, with prioritized operations
mult : (group / n) '*' (group / mult / n): @
add : (mult / n) '+' (add / mult / n)
group : '(' (add / mult) ')' : @
The first example is a well-known alternative to repetitive patterns. Note that the recursive call absolutely must be on right-side; otherwise matching will fall into an infinite recursion. (Try to understand why, figuring out the matching process.) While formally it would be correct on left-side. pijnu
may detect such cases in the future.
The following wouldn't run:
x : (x l) / l : @
Also note that there may well be transformations after the sign @
:
inlineText : (styledText / rawText)+ : @ join
A word (expressed inside double quotes) matches a literal, meaning a constant token. Most often it is used for keywords, operators, separators, delimiters, or any other kind of fixed code or tag in the language. A word can indeed hold a single character, so that strictly speaking we need not any special character pattern type at all.
The difference is rather semantic. Actually, a character (expressed using single quotes) has to be thought as a so-called "terminal" item of the language; say, an atom. Not only a character is a terminal but it can well have several uses & meanings. For instance:
DASH : '-'
MINUS_SIGN : DASH
SUBSTRACT_OP: DASH
WORD_BINDING: DASH
Here DASH
is a terminal atom, while the other patterns are literal tokens. If '-'
had a single use instead, we would rather express it directly; and it should be, semantically speaking, a single-char word:
MINUS_SIGN : "-"
Classes express a choice among a set of characters. Strings express the repetition of such a class. I hope the difference between a word and a string is obvious now, because I'm unable to make it clearer ;-)
Any repetition of a class automatically yields a string pattern instead of a general repetition:
validChar : [\x20..\x7e]
line : validChar+ # ==> String
text : (line '\n')+ # ==> Repetition
Aside machine-time performance (which is not among my concerns for now), the main advantage is that instead of yielding a branch node which value is a sequence of individual characters, the pattern will directly bring you a leaf node holding a single string -- which is indeed what you expect in 99.999% cases -- AFAIK.
In a hopefully near future, classes and characters will be compound-able. Clearly, this means that, provided the individual patterns below are klasses and characters,
identifier : letter / UNDERSCORE / digit / DOLLAR / POUND
will be rewritten as a single klass [a..zA..Z 1..9 _$#]
.
This is another good reason for having strings a special case of repetition, and characters exist aside single-char words.
In standard PEG, there is a very special class '.'
inherited from regex syntax. It reads "any char". While heavily used in regexes, that can only express single patterns, I do not find it useful in more complex grammars. The issue is that usually not any char is a valid char. Most often we need instead to explicitly define the set of valid characters. So that '.' should match only these ones.
pijnu
used to have for '.' an AnyChar
pattern type that had a special klass
attribute, in fact a config parameter. It will probably be reintroduced when configuration can be defined by the user inside the grammar.
Still, there is another limitation: in many cases, we have several classes of valid characters depending on the contaxt, so that this AnyChar.klass param needs to be properly updated.
There are sign and numbered repetitions.
pat*
pat+
pat{3}
pat{3..7}
Note: Actually, as of now, they are all matched by a single pattern type.
Repetition(pattern, numMin, numMax)
For '*', both numMin
and numMax
are False
, while for '+'
, numMin
is 1. For a single number, numMin=numMax
indeed!]
Possibly in the future, there may be distinct pattern types (and as a consequence distinct matching methods) for each kind of repetition.
Also, may be introduced {m..}
(with numMax=False
) and {..n}
(with numMin=False
) if ever need be.
... are the pillars of any parsing language ;-) A sequence pattern expresses a simple ordered com-position (= put aside). There may be in the future a kind of Set pattern type for unordered composition.
In a PEG-like grammar, a choice is prioritized. Which means that, given a pattern "a / b", when a succeeds, b will not even be tried. This maps well to the algorithmic nature of a syntax that expresses recognition patterns, and is for this reason perfectly adapted to parsing (remember we saw that the grammar and the parser carry the same meaning). BNF and regex are instead generative syntaxes. In a regex pattern such a "a|b" , both a and b are tried and, by default, the longest match wins! Be careful of that major difference if you are used to regexes or BNF.
Priority requires particular attention when matches for one pattern may start like matches for the other pattern. In this case, the longer matches will be masked by the shorter ones. For instance:
number : integer / real
is wrong because integer will always match, so that real will never be tested, and the higher-level pattern (e.g. operation) will fail on any '.'
inside a real number. Just swap the choice:
number : real / integer
See also a common gotcha below.
-
'&'
expresses a so-called "positive lookahead", named more shortly "Next" inpijnu
. -
'!'
expresses a so-called "negative lookahead", named "NextNot" inpijnu
.
Their common sense is to test whether a specific pattern is present, or not, next to the current position in source:
NEWLINE : '_n'
ENDOFLINE : &NEWLINE
NOTENDOFLINE: !NEWLINE
- They do not consume any character, id est the position does not change.
- They do not return any value, meaning that the node's value is in fact nil -- and will be dropped in a higher level node.
- The only relevant thing is to know whether they matched or not.
I personally nearly never use lookahead and wait for people pointing me to the common necessity for them.
Still, there is a case where we absolutely need them, which is a kind of gotcha. Imagine you have a repetition (maybe a string) followed by another pattern in sequence. In some naughty cases, the next pattern may be eaten by the repetitive one. Aaargh! The only solution is to include a NextNot expression for the other pattern inside the repetition. Examples:
pat : [a..z A..Z]+ "STOP"
# Will never match! Write instead:
pat : (!"STOP" [a..z A..Z])+ "STOP"
pat : ("X" / "Y" / "Z")* "X"{3}
# Will never match! Write instead:
pat : (!"X"{3} ("X" / "Y" / "Z"))* "X"{3}
Ugly, no? For this reason, pijnu
included for a while an "Until" or "StopRepetition" pattern kind that implemented the proper match semantics and could be written as:
pat : [a..z A..Z]+>"STOP"
Read: Match letters until you step on "STOP".
pat : ("X" / "Y" / "Z")*>"X"{3}
Read: Match X Y or Z until you step on 3 X-es.
Looks nice? It's been removed for the sake of simplicity, just because the need for it seems to be rare. It may come back later.
How to use pijnu
transforms so that you do not even need to process the results further?
Just for information: Transformations are easy and powerful because they apply on specific node types, at any depth level in the tree. Each transformation needs only cope with a single node, according to what its parent expects, and can assume to receive data from child nodes in the most adequate format. Node transformations proceed from the simplest terminal bits up to the whole tree step-by-step. For this reason we could call this recursive ascent processor ;-), as opposed to recursive descent parser. In comparison, transforming unprocessed compound nodes is always trouble.
TODO
We want a parser for wiki-like inline text. Actually, this would be a part of a larger grammer. Our aim is not only to parser it, in fact, but to get the x/html equivalent directly!
A simple version's specification can be.
- Valid characters are ASCII 'black' characters, plus space.
- There can be raw or 'styled' text.
- Styled text comes in two kinds (see grammar).
- Every styled section maps to an x/htlm 'span' section
The format of an x/html span is: <span class="style_name">...content text...</span>
These sections would themselves map to style rendering definitions, typically in CSS.
How easy it is: one single 3-lines func. (Actually, it could be a single, still legible, line ;-).
wikInlineGrammar = r"""
wikInline
<toolset>
def unescape(node):
node.value = node[1].value
def styledSpan(node):
klass = node.typ
text = node.value
node.value = '<span class="%s">%s</span>' %(klass,text)
<definition>
# codes
ESCAPE : '~'
DISTINCT : "//" : drop
IMPORTANT : "**" : drop
styleCode : (DISTINCT / IMPORTANT)
# character expression
escChar : ESCAPE ('*' / '!' / '/' / ESCAPE) : unescape
validChar : [\\x20..\\xff !!/!*~]
rawText : (escChar / (!styleCode validChar))+ : join
# text kinds
distinctText : DISTINCT inlineText DISTINCT : liftValue
importantText : IMPORTANT inlineText IMPORTANT : liftValue
styledText : distinctText / importantText : styledSpan
inlineText : (styledText / rawText)+ : @ join
"""
# parser object
makeParser(wikInlineGrammar)
from wikInlineParser import wikInlineParser
# test
text = "~~ ~*__~*~*__**~***__**~*~***__**X****X**"
wikInlineParser.test(text)
text = "aaa//bbb//!!ccc!!**ddd**//eee!!fff**ggg**hhh!!iii//jjj"
wikInlineParser.test(text)
text = "1//2!!3//!!4" ### should fail
wikInlineParser.test(text,"parseTest")
I'll let you explore the grammar, as well as the results of runs, and do your own trials.
Notes
- You can structure the grammar using blank lines, comments and indentation.
- There are few tests, but they try to catch possible errors on 'naughty' cases ;-)
- Coding chars
(/*!)
need to be escaped, but there are alternative schemes. - The 'unescape' func is not really needed (how else can we do it?).
As another example, let us parse and process pijnu
's own format for character classes. The goal is to get as result a full expanded and 'excluded' charset of valid characters.
To clarify just a little bit the problem, we will do this only for the same set of 'inline' characters as in the previous wiki grammar. (Mainly to avoid codes such as '\t'
-- but you can add this case as an exercise ;-) (You may also do it for all Unicode chars, it would be ready for next version of pijnu
;-)
klassesGrammar = r"""
klasses
<toolset>
def charFromEsc(node):
node.value = node[0].value
def charFromDec(node):
ord = int(node.value)
node.value = chr(ord)
def charFromHex(node):
ord = int(node.value, 16)
node.value = chr(ord)
def charsetFromRanj(node):
(n1,n2) = (ord(node[0].value),ord(node[1].value))
chars = [chr(n) for n in range(n1, n2+1)]
node.value = ''.join(chars)
def excludedCharset(node):
(inCharset,exCharset) = (node[0].value,node[1].value)
chars = [c for c in inCharset if c not in exCharset]
node.value = ''.join(chars)
<definition>
# coding
LBRACKET : '[' : drop
RBRACKET : '\]' : drop
SEP : " " : drop
EXCLUSION : "!!" : drop
RANGE : ".." : drop
ESC : "\\" : drop
DEC : "\\" : drop
HEX : "\\x" : drop
DECNUM : [0..9]
HEXNUM : [0..9 a..eA..E]
# character expression
safeChar : !EXCLUSION [\x20..\x7e !!\]] : liftNode
escChar : ESC '\]' : charFromEsc
decChar : DEC DECNUM{3} : join charFromDec
hexChar : HEX HEXNUM{2} : join charFromHex
char : hexChar / decChar / escChar / safeChar
# klass format
ranj : char RANGE char : charsetFromRanj
klassItem : SEP / ranj / char
charset : klassItem+ : join
exclCharset : charset EXCLUSION charset : excludedCharset
klass : LBRACKET (exclCharset / charset) RBRACKET : liftValue
"""
makeParser(klassesGrammar)
from klassesParser import klassesParser
text = r"""[] [abc!'"\]] [a..e] [abcxyz \097..\101] [\x61..\x65 1..9]"""
print klassesParser.findAll(text)
text = r"""[!!] [abc!!] [a!!b!!c] [abc!!b] [a..c 1..9 !'\]" !!\'b 2..8]"""
print klassesParser.findAll(text)
Now, imagine range expansion does not bring expected result. There is an issue with the pattern, the transformation func, or both!
A really practical way to debug is simply to add output inside the transform. Basically, print node
at start and end of the func will usually bring you all info you need.
An alternative is to use the in-built debugOutput
transform. This one does not change the node, instead it outputs in a rather verbose manner (*). Try it on one or two patterns.
(*) It uses the Node.WHOLE_INFO
config parameter. TODO: let users specify config inside the grammar. For now, you can set them in code -- but they're not yet documented ;-)
testDict = numbersTransformParser.number.testSuiteDict(sources)
numbersTransformParser.number.testSuite(testDict)
==>
test_suite_dict = {
'1' : 1.0
'-1' : None
'1.' : 1.0
'12.345' : 12.345000000000001
'-1.1' : None
}
*** test suite passed ***
As shown above, you can use the dict as argument to testSuite. Actually, testSuiteDict should be run once well all gets fine, then testSuite will check that modifications do not alter functionality.