Simple PEG parser builder written on Python.
There are many parser generators for python but each of 'em has IMO some disadvantages: e.g. not suitable for stream processing, or does not allow to track parsing and understand where it was failed, or does not support mode w/o skipping whitespaces, or has ugly rules syntax, or does not allow to attach semantic action in-place...
"Parsed" is inspired by "arpeggio" (http://arpeggio.googlecode.com/). This parser generator is interesting and there was the chance I'll start to use it but... maybe I am too lazy to study the code and modify it to correspond my needs or, most probably, it is interesting to write own parser... so I wrote One More parser generator.
To create parser and use predefined rules one need to import parsed package.
from parsed import *
Parser generator function accepts no parameters and is decorated by @rule, it should return a rule, e.g.:
@rule
def abc(): return char('abc') > value
As expected result of calling a generator function is a parser. Parser gets its name from generator function name. This is why lambdas etc. are not used. To correctly parse iterable source one should wrap it with the source() function.
parser = abc()
src = 'a'
position, result = parser.parse(src)
if result != nomatch:
print "MATCH", result, "@", position
else:
print "NO MATCH"
-
against single char
#single char, ignore by default @rule def is_A(): return char('A')
-
against char from iterable
#matches against LF or CR, on match by default returns #matched character @rule def vspace(): return char('\r\n')
-
against predicate
import string #boolean predicate to test a char for some condition def __is_punct(c): return c in string.punctuation #any char matching predicate, return a value by default @rule def is_punctuation(): return char(__is_punct)
Matching sequence of rules, using operator '+':
#hash symbol('#'), followed by any 'abcABC'
@rule
def hashed_abc(): return char('#') + 'abcABC'
Short circuiting 'OR':
#hashed_abc from the example above OR any char from 'abc' set
@rule
def hashed_abc_or_abc(): return hashed_abc | 'abc'
#any character except 'a', does not consume input
@rule
def not_a(): return ~char('a')
#character 'a' repeated 1 or more times
@rule
def one_or_more_a(): return char('a')[1:]
#character 'a' repeated 0 or more times
@rule
def zero_or_more_a(): return char('a')[0:]
#character 'a' or its absence
@rule
def maybe_a(): return char('a')[0:1]
Lookahead can be expressed or negating (adding prefix '-' operator) a rule or by appending rule to sequence using bitwise AND ('&') operator.
#character 'a', matches only if followed by any character from 'abc'
#set, does not consume input
@rule
def a_before_abc(): return char('a') + -char('abc')
#the same as above, in this case lookahead match is always
#excluded from parsing results
@rule
def also_a_before_abc(): return char('a') & char('abc')
#extract a list of characters from double quoted string
#consisting from 'abc' characters
@rule
def dquoted_abc(): return '"' + char('abc')[1:] + '"' > first
#compose 2 characters in string like "{CHAR1}&{CHAR2}" if first is
#from 'abc' set and second - from 'def'. lambda should be enclosed in
#braces because it has lowest precendence
@rule
def abc_def(): return char('abc') + 'def' > (lambda x: x[0] + '&' + x[1])
#lookahead result can be also included in parsing results if
#lookahead declared using prefix '-' but not binary '&'
@rule
def a_before_abc(): return char('a') + (-char('abc') > value)
All examples above can be run and tested by running examples/readmy.py
TODO
nomatch #means rule is not matched
empty #end of input (eof) or ignored result
TODO
list2str #string from list
first, second #first/second element of list
nth(N) #returns action extracting Nth list element
ignore #return empty constant
TODO
-
is_trace: if it is set to True, parsing process will be traced into stderr
-
is_remember: if True --- use memoization
-
use_unicode
-
is_stat: gather matching statistics (now only hits and miss count for each rule)
-
add option to ignore whitespaces
-
add Abstract Syntax Tree visualization
-
parser optimization
-
more unit tests to avoid regressions
-
?..