diff --git a/FromClause.lhs b/FromClause.lhs index 048bde4..3951f41 100644 --- a/FromClause.lhs +++ b/FromClause.lhs @@ -10,8 +10,9 @@ TODO: qualify or add explicit imports > --import Text.Groom (groom) > --import qualified Text.Parsec as P -> import qualified Text.Parsec.String as P -> import Text.Parsec (try,optional, option,choice) +> import Text.Parsec.String (Parser) +> import Text.Parsec.String.Parsec (try) +> import Text.Parsec.String.Combinator (optional, option,choice) > import Control.Applicative ((<$>),(*>),(<*>), (<$)) > --import Control.Monad (void,guard) > --import Debug.Trace @@ -106,7 +107,7 @@ because of all the keywords. TODO: try and simplify this code some more. -> from :: P.Parser [TableRef] +> from :: Parser [TableRef] > from = option [] (try (keyword_ "from") *> commaSep1 tref) > where > tref = choice [try (JoinQueryExpr <$> parens queryExpr) @@ -149,5 +150,5 @@ TODO: try and simplify this code some more. Here is another small helper parser. Having the arguments in this order makes it easy to chain using >>=. -> optionSuffix :: (a -> P.Parser a) -> a -> P.Parser a +> optionSuffix :: (a -> Parser a) -> a -> Parser a > optionSuffix p a = option a (p a) diff --git a/ParseFile.lhs b/ParseFile.lhs index 184161d..2894ea3 100644 --- a/ParseFile.lhs +++ b/ParseFile.lhs @@ -6,7 +6,7 @@ parser given below. > import System.Environment > import Text.Parsec -> import Text.Parsec.String as P +> import Text.Parsec.String > import Control.Monad > main :: IO () @@ -18,7 +18,7 @@ parser given below. This is the parser which you can replace with your own code: -> myParser :: P.Parser () +> myParser :: Parser () > myParser = void $ string "correct" Here is an example of running this program: diff --git a/ParseString.lhs b/ParseString.lhs index 6b325b1..fd42ce1 100644 --- a/ParseString.lhs +++ b/ParseString.lhs @@ -6,7 +6,7 @@ below. > import System.Environment > import Text.Parsec -> import Text.Parsec.String as P +> import Text.Parsec.String > import Control.Monad > main :: IO () @@ -18,7 +18,7 @@ below. This is the parser which you can replace with your own code: -> myParser :: P.Parser () +> myParser :: Parser () > myParser = void $ string "correct" Here is an example of running this program: diff --git a/ParsecExtras.lhs b/ParsecExtras.lhs index eede556..5ca0105 100644 --- a/ParsecExtras.lhs +++ b/ParsecExtras.lhs @@ -21,12 +21,14 @@ means you have to replace it with something more flexible later on. > --import qualified Text.Parsec as P -> import Text.Parsec (oneOf, many1, letter, char, digit, string) -> import qualified Text.Parsec.String as P (Parser) +> import Text.Parsec.String (Parser) +> import Text.Parsec.String.Combinator (many1) +> import Text.Parsec.String.Char (letter, char, digit, string, oneOf) + > import Control.Applicative ((<$>), (<*>), (<*), (<|>), many) > import Control.Monad (void) -> import Control.Monad.Identity (Identity) -> import qualified Text.Parsec.Expr as E + +> import qualified Text.Parsec.String.Expr as E = Text.Parsec.Expr @@ -127,13 +129,13 @@ Here is the abstract syntax type: Here is the new expression parser: -> simpleExpr2 :: P.Parser SimpleExpr2 +> simpleExpr2 :: Parser SimpleExpr2 > simpleExpr2 = E.buildExpressionParser table term -> term :: P.Parser SimpleExpr2 +> term :: Parser SimpleExpr2 > term = var2 <|> num2 -> table :: [[E.Operator String () Identity SimpleExpr2]] +> table :: [[E.Operator SimpleExpr2]] > table = [[prefix "-", prefix "+"] > ,[binary "^" E.AssocLeft] > ,[binary "*" E.AssocLeft @@ -158,13 +160,14 @@ Here is the new expression parser: TODO: expand and explain the bits. -> num2 :: P.Parser SimpleExpr2 +> num2 :: Parser SimpleExpr2 > num2 = Num2 <$> integer -> var2 :: P.Parser SimpleExpr2 +> var2 :: Parser SimpleExpr2 > var2 = Var2 <$> identifier -TODO: write lots of parsing examples, including parse failures with ambiguity. +TODO: write lots of parsing examples, including parse failures with +ambiguity. issue: double prefix op. @@ -181,13 +184,13 @@ parsers instead of writing them by hand. TODO: examples -> whiteSpace :: P.Parser () +> whiteSpace :: Parser () > whiteSpace = void $ many $ oneOf " \n\t" -> integer :: P.Parser Integer +> integer :: Parser Integer > integer = read <$> many1 digit <* whiteSpace -> identifier :: P.Parser String +> identifier :: Parser String > identifier = (:) <$> firstChar <*> many nonFirstChar <* whiteSpace > where > firstChar = letter <|> char '_' diff --git a/ParsingIntroduction.lhs b/ParsingIntroduction.lhs index 96c7635..650200d 100644 --- a/ParsingIntroduction.lhs +++ b/ParsingIntroduction.lhs @@ -1,5 +1,5 @@ -This is an introduction to pasering with Haskell and Parsec. +This is an introduction to parsing with Haskell and Parsec. Prerequisites: you should know some basic Haskell and have the Haskell Platform installed (or GHC + Parsec installed in some other way). @@ -7,18 +7,38 @@ Platform installed (or GHC + Parsec installed in some other way). This tutorial was written using GHC 7.6.3 and Parsec 3.1.3, which are the versions which come with the Haskell Platform 2013.2.0.0. -TODO: Here is some other information on Parsec and Haskell: -links, tutorials on fp, section in rwh, lyah?, old parsec docs, -parsec docs on hackage, other parser combinator libs (uu, trifecta?) +This file is a Literate Haskell file, available here: + + +I recommend you download all the lhs files, and follow along in your +favourite editor, and use ghci to experiment. To download, use: + +``` +git clone https://github.com/JakeWheat/intro_to_parsing.git +``` + +TODO: experiment: create wrapper modules for all the interesting stuff +in parsec, which fixes the parser types to use +Text.Parsec.String.Parser. This should make things much simpler for +beginners, especially reading the code and dealing with error +messages. -This file is a Literate Haskell file, available here: TODO +modules: +parse, ParseError from somewhere +Text.Parsec.Char +Text.Parsec.Combinator -I recommend you download this lhs file, and follow along in your -favourite editor, and use ghci to experiment. +> import Text.Parsec (ParseError) +> import Text.Parsec.String (Parser) + +> import Text.Parsec.String.Parsec (try, parse) + +> import Text.Parsec.String.Char (anyChar, oneOf, char, digit +> ,string, letter, satisfy) + +> import Text.Parsec.String.Combinator (eof,manyTill, many1, choice +> ,anyToken,chainl1, between) -> import qualified Text.Parsec as P -> import Text.Parsec (anyChar, eof, manyTill, oneOf, many1, letter, char, choice, try, satisfy, digit, string, anyToken,chainl1,between) -> import qualified Text.Parsec.String as P (Parser) > import Control.Applicative ((<$>), (<*>), (<$), (<*), (*>), (<|>), many) > import Control.Monad (void, ap) > import Data.Char (isLetter, isDigit) @@ -27,29 +47,16 @@ favourite editor, and use ghci to experiment. The first parser: -> oneChar :: P.Parser Char +> oneChar :: Parser Char > oneChar = anyChar +This tutorial is going to start very slowly. + Whenever we write a parser which parses to a value of type a, we give -it the return type of P.Parser a. In this case, we parse a character -so the return type is P.Parser Char. The P.Parser type is in the +it the return type of `Parser a`. In this case, we parse a character +so the return type is `Parser Char`. The `Parser` type is in the module Text.Parsec.String. We will cover this in more detail later. -You should always use a type signature with these parsers. Because the -Parsec code is really generalized, without the type GHC will refuse to -compile this code. Try commenting out the type signature above and -loading into ghci to see the error message. - -You can get this code to compile without a type signature by using the -NoMonomorphismRestriction language pragma. You can also see the type -signature that GHC will choose for this function by commenting the -type signature and using -Wall and -XNoMonomorphismRestriction -together. It's up to you whether you prefer to always write type -signatures when you are developing parsing code, or use the -NoMonomorphismRestriction pragma. If you make a mistake, when using -type signatures you usually get much more easier to understand -compiler error messages. - Let's use this parser. Change to the directory where you saved this .lhs file, and run ghci. Then type in ':l ParsingIntroduction.lhs'. You can run the parser using a wrapper, enter the following at the ghci @@ -57,7 +64,7 @@ prompt: 'regularParse oneChar "a"'. Here is a transcript: -~~~~ +``` $ ghci GHCi, version 7.6.3: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. @@ -68,31 +75,37 @@ Prelude> :l ParsingIntroduction.lhs Ok, modules loaded: Main. *Main> regularParse oneChar "a" Right 'a' -~~~~ +``` Here are some examples of running this parser on various input: -~~~~ +``` *Main> regularParse oneChar "a" Right 'a' + *Main> regularParse oneChar "b" Right 'b' + *Main> regularParse oneChar "0" Right '0' + *Main> regularParse oneChar " " Right ' ' + *Main> regularParse oneChar "\n" Right '\n' + *Main> regularParse oneChar "aa" Right 'a' + *Main> regularParse oneChar "" Left (line 1, column 1): unexpected end of input + *Main> regularParse oneChar " a" Right ' ' -~~~~ +``` -TODO: show in table form as well. You can see that if there are no characters, we get an error. Otherwise, it takes the first character and returns it, and throws @@ -102,58 +115,88 @@ regularParse will come later. Here are two alternatives to regularParse you can also use for experimenting for the time being: -~~~~ +``` *Main> regularParse oneChar "a" Right 'a' + *Main> parseWithEof oneChar "a" Right 'a' + *Main> parseWithLeftOver oneChar "a" Right ('a',"") + *Main> *Main> regularParse oneChar "" Left (line 1, column 1): unexpected end of input + *Main> parseWithEof oneChar "" Left (line 1, column 1): unexpected end of input + *Main> parseWithLeftOver oneChar "" Left (line 1, column 1): unexpected end of input + *Main> regularParse oneChar "aa" Right 'a' + *Main> parseWithEof oneChar "aa" Left (line 1, column 2): unexpected 'a' expecting end of input + *Main> parseWithLeftOver oneChar "aa" Right ('a',"a") -~~~~ +``` -TODO: put in table as well -more examples to make the behaviour more clear. +TODO: more examples to make the behaviour more clear. You can use these functions and ghci to experiment. Try running all the parsers in ghci on various input strings as you work through the -document to get a good feel for all the different features. You can -also write the parsers inline in the function call, for example: +document to get a good feel for all the different features. Tip: you +can also write the parsers inline in the function call, for example: -~~~~ +``` *Main> regularParse (many1 digit) "1" Right "1" + *Main> regularParse (many1 digit) "122" Right "122" -~~~~ +``` This can be used to quickly try out new ad hoc parsers. += Type signatures + +TODO: update with new wrapper types which mean all this nonsense can +be delayed to much later. + +I think you should always use type signatures with Parsec. Because the +Parsec code is really generalized, without the type GHC will refuse to +compile this code. Try commenting out the type signature above and +loading into ghci to see the error message. + +There is an alternative: you can get this code to compile without a +type signature by using the NoMonomorphismRestriction language +pragma. You can also see the type signature that GHC will choose for +this function by commenting the type signature and using -Wall and +-XNoMonomorphismRestriction together. + +It's up to you whether you prefer to always write type signatures when +you are developing parsing code, or use the NoMonomorphismRestriction +pragma. Even if you can use NoMonomorphismRestriction, when using +explicit type signatures you usually get much simpler compiler error +messages. + = Text.Parsec.Char Let's go through some of the functions in Text.Parsec.Char module from the Parsec package. The haddock is avaialble here: -(http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Char.html). +. -~~~~ +``` satisfy :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char -~~~~ +``` This is the main primitive function in Parsec. This looks at the next character from the current input, and if the function (Char -> Bool) @@ -161,28 +204,31 @@ returns true for this character, it 'pops' it from the input and returns it. The current position in the input string is tracked behind the scenes. -Using our simplified parser type - P.Parser - the satisfy function's +Using our simplified parser type - `Parser` - the satisfy function's type can be written like this: -~~~~ -satisfy :: (Char -> Bool) -> P.Parser Char -~~~~ +``` +satisfy :: (Char -> Bool) -> Parser Char +``` TODO: a few examples Tip: if you look at the docs on hackage -(http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Char.html), +, you can view the source. Most of the functions in Text.Parsec.Char are -straightforward. You can see that the satisfy function is a little -more primitive than the other functions from the source. I'm going to -skip the explanation for the implementation of satisfy for now. +straightforward. I recommend you look at the source for all of these +functions. + +You can see in the source that the satisfy function is a little more +primitive than the other functions. We're going to skip the explanation +for the implementation of satisfy for now. Here is the parser we used above in the oneChar parser, with a simplified type: -~~~~ +``` anyChar :: Parser Char -~~~~ +``` If you look at the source via the haddock link above, you can see it uses satisfy. @@ -192,19 +238,19 @@ validation functions. The char parses a specific character which you supply: -~~~~ +``` char :: Char -> Parser Char -~~~~ +``` TODO: examples These parsers all parse single hardcoded characters -~~~~ +``` space :: Parser Char newline :: Parser Char tab :: Parser Char -~~~~ +``` They all return a Char. You should be able to guess what Char they return, you can double check your intuition using ghci. @@ -212,7 +258,7 @@ return, you can double check your intuition using ghci. These parser all parse one character from a hardcoded set of characters: -~~~~ +``` upper :: Parser Char lower :: Parser Char alphaNum :: Parser Char @@ -220,18 +266,19 @@ letter :: Parser Char digit :: Parser Char hexDigit :: Parser Char octDigit :: Parser Char -~~~~ +``` In these cases, the return value is less redundant. -TODO: examples, with wrong chars, multiple chars as well as matches +TODO: a few examples, with wrong chars, multiple chars as well as +matches oneOf and noneOf parse any of the characters in the given list -~~~~ +``` oneOf :: [Char] -> Parser Char noneOf :: [Char] -> Parser Char -~~~~ +``` TODO: examples @@ -241,9 +288,9 @@ string matches a complete string, one character at a time. We will skip the explanation of the implementation for now. TODO: implement string directly in terms of satisfy, and quickcheck it. -~~~~ +``` string :: String -> Parser String -~~~~ +``` TODO: examples @@ -251,9 +298,9 @@ TODO: examples here is another slight extension, which uses a combinator (skipMany) if you look at the source. We will cover this combinator shortly. -~~~~ +``` spaces :: Parser () -~~~~ +``` TODO: examples @@ -262,14 +309,13 @@ TODO: examples Here are two exes which you can use to parse either a string or a file to help you experiment. -TODO: make these into links + -ParseString.lhs + -ParseFile.lhs - -You can experiment using ghci, or with a string on the command line, -or by putting the text to parse into a file and parsing that. +Now you can easily experiment using ghci, or with a string on the +command line, or by putting the text to parse into a file and parsing +that. TODO: transcript of using these exes. @@ -283,6 +329,9 @@ Let's create a very simple expression language: > | Parens SimpleExpr > deriving (Eq,Show) +It's a bit simple and almost useless at the moment, but we will expand +on this a lot in later tutorials. + Here are some examples: > simpleExprExamples :: [(String,SimpleExpr)] @@ -297,7 +346,8 @@ TODO: some more complex examples Let's write a simple parser for these, and introduce a few things along the way. -We can start by writing a parser for each ctor in turn. +We will write a parser for each constructor separately, then look at +how we can write a parser for all of them together. === Num @@ -305,7 +355,7 @@ To parse a number, we need to parse one or more digits, and then read the resulting string. We can use the combinator 'many1' to help with this. We will also use do notation. -> num :: P.Parser SimpleExpr +> num :: Parser SimpleExpr > num = do > n <- many1 digit > return (Num (read n)) @@ -318,9 +368,9 @@ integer using read, and wrap it in a Num constructor. The many1 function's type looks like this: -~~~~ -many1 :: P.Parser a -> P.Parser [a] -~~~~ +``` +many1 :: Parser a -> Parser [a] +``` It applies the parser given one or more times, returning the result. @@ -333,7 +383,7 @@ for a common choice: identifiers must start with a letter or underscore, and then they can be followed by zero or more letters, underscores or digits in any combination. -> var :: P.Parser SimpleExpr +> var :: Parser SimpleExpr > var = do > fc <- firstChar > rest <- many nonFirstChar @@ -347,7 +397,7 @@ letter or underscore, and nonFirstChar = which parses a digit, letter or underscore. This time, we use the 'many' function instead of 'many1': TODO - demonstrate why using examples only. -> add :: P.Parser SimpleExpr +> add :: Parser SimpleExpr > add = do > e0 <- num > void $ char '+' @@ -364,34 +414,34 @@ but supresses a warning which you should get (since you are using -Wall, right?) and I think it is also good style to explicitly say that the result is being ignored. -Tip: to use -Wall in ghci, enter the following at the prompt: +To use -Wall in ghci, enter the following at the prompt: -~~~~ +``` *Main> :set -Wall -~~~~ +``` Try it out, then replace the line -~~~~ +``` void $ char '+' -~~~~ +``` with -~~~~ +``` char '+' -~~~~ +``` And check you see the warning. Another way of avoiding the warning is to write this: -~~~~ +``` _ <- char '+' -~~~~ +``` === parens -> parens :: P.Parser SimpleExpr +> parens :: Parser SimpleExpr > parens = do > void $ char '(' > e <- num @@ -404,9 +454,9 @@ parser again instead. Now we will tackle the whitespace issue. == whitespace and lexeme parsing -Here is a parser which will skip 0 or more whitespace characters. +Here is a parser which will skip zero or more whitespace characters. -> whiteSpace :: P.Parser () +> whiteSpace :: Parser () > whiteSpace = void $ many $ oneOf " \n\t" In the original parsec documentation, one of the concepts mentioned is @@ -417,7 +467,7 @@ whitespace exactly once wherever it needs to be skipped. To complete the lexeme style, we should also always skip leading whitespace at the top level only. -> parseWithWhitespace :: P.Parser a -> String -> Either P.ParseError a +> parseWithWhitespace :: Parser a -> String -> Either ParseError a > parseWithWhitespace p = parseWithEof wrapper > where > wrapper = do @@ -426,53 +476,89 @@ top level only. the wrapper function can also use (>>) to make it a bit shorter: -~~~~ +``` wrapper = whiteSpace >> p -~~~~ +``` Here is a shorter version of this function using (>>): -> parseWithWhitespace' :: P.Parser a -> String -> Either P.ParseError a +> parseWithWhitespace' :: Parser a -> String -> Either ParseError a > parseWithWhitespace' p = parseWithEof (whiteSpace >> p) Here is the num parser rewritten in the lexeme style: -> num' :: P.Parser SimpleExpr -> num' = do +> lexeme :: Parser a -> Parser a +> lexeme p = do +> x <- p +> whiteSpace +> return x + +TODO: review the placement of the function 'lexeme' in all the code +below. Maybe something could be said about all the places that it can +be put. + +> num' :: Parser SimpleExpr +> num' = lexeme $ do > n <- many1 digit -> whiteSpace > return (Num (read n)) -TODO: examples parseWithEof num "1", " 1", "1 ", " 1 ", then -parserWithWhitespace num' +Here it is in action: + +``` +*Main Data.List> parseWithEof num "1" +Right (Num 1) + +*Main Data.List> parseWithEof num " 1" +Left (line 1, column 1): +unexpected " " +expecting digit + +*Main Data.List> parseWithEof num "1 " +Left (line 1, column 2): +unexpected ' ' +expecting digit or end of input + +*Main Data.List> parseWithEof num " 1 " +Left (line 1, column 1): +unexpected " " +expecting digit + +*Main Data.List> parseWithWhitespace num' "1" +Right (Num 1) + +*Main Data.List> parseWithWhitespace num' " 1" +Right (Num 1) + +*Main Data.List> parseWithWhitespace num' "1 " +Right (Num 1) + +*Main Data.Lst> parseWithWhitespace num' " 1 " +Right (Num 1) +``` Here are the other functions in lexeme style. -> var' :: P.Parser SimpleExpr -> var' = do +> var' :: Parser SimpleExpr +> var' = lexeme $ do > fl <- firstChar > rest <- many nonFirstChar -> whiteSpace > return (Var (fl:rest)) > where > firstChar = satisfy (\a -> isLetter a || a == '_') > nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_') -> add' :: P.Parser SimpleExpr +> add' :: Parser SimpleExpr > add' = do > e0 <- num' -> void $ char '+' -> whiteSpace +> void $ lexeme $ char '+' > e1 <- num' > return (Add e0 e1) -> parens' :: P.Parser SimpleExpr +> parens' :: Parser SimpleExpr > parens' = do -> void $ char '(' -> whiteSpace +> void $ lexeme $ char '(' > e <- num' -> void $ char ')' -> whiteSpace +> void $ lexeme $ char ')' > return (Parens e) In this style, you have to be slightly careful to make sure you call @@ -481,7 +567,7 @@ whitespace at the right points. Let's try and implement the simpleExpr parser. We can use a function called 'choice': -> numOrVar :: P.Parser SimpleExpr +> numOrVar :: Parser SimpleExpr > numOrVar = choice [num', var'] It tries each parser one at a time, finishing with the first one that @@ -489,7 +575,7 @@ succeeds. There are some more details about this later on. Here is another way to write the numOrVar parser: -> numOrVar' :: P.Parser SimpleExpr +> numOrVar' :: Parser SimpleExpr > numOrVar' = num' <|> var' In general, you can write 'choice [p0, p1, p2, ...]' as 'p0 <|> p1 <|> @@ -499,7 +585,7 @@ TODO: a bunch of examples Here is the first version of the simpleExpr parser: -> simpleExpr :: P.Parser SimpleExpr +> simpleExpr :: Parser SimpleExpr > simpleExpr = choice [num', var', add', parens'] TODO: a bunch of examples @@ -508,7 +594,7 @@ It works well for some of the parsers, but fails with add': the num' always partially succeeds first, then fails, so the add' is never tried. We can rearrange this parser like this: -> simpleExpr' :: P.Parser SimpleExpr +> simpleExpr' :: Parser SimpleExpr > simpleExpr' = choice [add', num', var', parens'] TODO: examples again @@ -523,7 +609,7 @@ and two char match/no match. Here is one way to fix it: -> simpleExpr'' :: P.Parser SimpleExpr +> simpleExpr'' :: Parser SimpleExpr > simpleExpr'' = choice [try add', num', var', parens'] The try function implements backtracking. When this is used in a @@ -536,7 +622,7 @@ will fail immediately. The same happens with <|>, we can implement the simpleExpr parser like this also: -> simpleExpr''' :: P.Parser SimpleExpr +> simpleExpr''' :: Parser SimpleExpr > simpleExpr''' = try add' <|> num' <|> var' <|> parens' TODO: show the examples all working @@ -544,22 +630,19 @@ TODO: show the examples all working Now we can make 'parens' and 'add' use a general simple expression parser. Parens is simple: -> parens'' :: P.Parser SimpleExpr +> parens'' :: Parser SimpleExpr > parens'' = do -> void $ char '(' -> whiteSpace +> void $ lexeme $ char '(' > e <- simpleExpr''' -> void $ char ')' -> whiteSpace +> void $ lexeme $ char ')' > return (Parens e) There is a problem implementing 'add' in the same way: -> add'' :: P.Parser SimpleExpr +> add'' :: Parser SimpleExpr > add'' = do > e0 <- simpleExpr''' -> void $ char '+' -> whiteSpace +> void $ lexeme $ char '+' > e1 <- simpleExpr''' > return (Add e0 e1) @@ -568,26 +651,25 @@ again. Let's look at another problem: -~~~~ +``` *Main> parseWithWhitespace simpleExpr''' " 1 + 1 + 1" Left (line 1, column 8): unexpected '+' expecting end of input -~~~~ +``` Our parser will only parse one operator, and not a chain of them. Here is one way to solve it: -> simpleExpr4 :: P.Parser SimpleExpr +> simpleExpr4 :: Parser SimpleExpr > simpleExpr4 = do > e <- term > maybeAddSuffix e > where > maybeAddSuffix e = addSuffix e <|> return e > addSuffix e0 = do -> void $ char '+' -> whiteSpace +> void $ lexeme $ char '+' > e1 <- term > maybeAddSuffix (Add e0 e1) > term = num' <|> var' <|> parens' @@ -603,32 +685,28 @@ can find online, etc.. Here is the all the parser code written out again for clarity. -> numD :: P.Parser SimpleExpr -> numD = do +> numD :: Parser SimpleExpr +> numD = lexeme $ do > n <- many1 digit -> whiteSpace > return $ Num $ read n -> varD :: P.Parser SimpleExpr -> varD = do +> varD :: Parser SimpleExpr +> varD = lexeme $ do > fl <- firstChar > rest <- many nonFirstChar -> whiteSpace > return $ Var (fl:rest) > where > firstChar = satisfy (\a -> isLetter a || a == '_') > nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_') -> parensD :: P.Parser SimpleExpr +> parensD :: Parser SimpleExpr > parensD = do -> void $ char '(' -> whiteSpace +> void $ lexeme $ char '(' > e <- simpleExprD -> void $ char ')' -> whiteSpace +> void $ lexeme $ char ')' > return $ Parens e -> simpleExprD :: P.Parser SimpleExpr +> simpleExprD :: Parser SimpleExpr > simpleExprD = do > e <- term > maybeAddSuffix e @@ -637,8 +715,7 @@ Here is the all the parser code written out again for clarity. > choice [addSuffix e > ,return e] > addSuffix e0 = do -> void $ char '+' -> whiteSpace +> void $ lexeme $ char '+' > e1 <- term > maybeAddSuffix (Add e0 e1) > term = numD <|> varD <|> parensD @@ -646,23 +723,40 @@ Here is the all the parser code written out again for clarity. == Testing with the examples +TODO: write a little manual tester that accepts a parser and a list of +examples, and checks they all parse correctly. + == Testing with quickcheck +Let's see if we can check with quickcheck. It's a bit tricky testing +parsers in this way, but one way to do something useful is to generate +random asts, convert them to concrete syntax, parse them, and check +the result. We can write a simple 'pretty printer' to convert an ast +to concrete syntax. + === a pretty printer +TODO: a really simple pretty printer just pasting strings together, no +layout. + === the quick check code +TODO: write a quickcheck property and arbitary instance and show +running it at the ghci prompt + = tidying things up Now we can go back over the code already written, and make it much -more concise, whilst also trying to make it more readable. +more concise, whilst also trying to make it more readable. We are +going to use the typeclass Applicative and some functions from this +typeclass. Here is the basic pattern behind a lot of this. Here is a function which takes a constructor and two parsers for the two arguments for -the constructor. It parses the two arguements, then applies the +the constructor. It parses the two arguments, then applies the constructor to the results: -> myParser1 :: (a -> b -> c) -> P.Parser a -> P.Parser b -> P.Parser c +> myParser1 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c > myParser1 ctor pa pb = do > a <- pa > b <- pb @@ -672,75 +766,75 @@ TODO: concrete example This can be rewritten without the do syntactic sugar like this: -> myParser2 :: (a -> b -> c) -> P.Parser a -> P.Parser b -> P.Parser c +> myParser2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c > myParser2 ctor pa pb = > pa >>= \a -> pb >>= \b -> return $ ctor a b And this can be rewritten like this: -> myParser3 :: (a -> b -> c) -> P.Parser a -> P.Parser b -> P.Parser c +> myParser3 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c > myParser3 ctor pa pb = ctor `fmap` pa `ap` pb (This uses functions from Applicative instead of Monad). We replace -the use of >>= with 'fmap' and 'ap'. This isn't always possible, but +the use of `>>=` with `fmap` and `ap`. This isn't always possible, but it often is. -Here is the version using the operator versions (fmap changed to <$>, -and ap changed to <*>). These two operators are just alternative +Here is the version using the operator versions (`fmap` changed to `<$>`, +and `ap` changed to `<*>`). These two operators are just alternative spellings of fmap and ap. -> myParser4 :: (a -> b -> c) -> P.Parser a -> P.Parser b -> P.Parser c +> myParser4 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c > myParser4 ctor pa pb = ctor <$> pa <*> pb This style takes less typing, and is often much simpler to write and -read when you are familiar with it. +read. This pattern 'scales', you can use: -~~~~ +``` Ctor <$> pa -~~~~ +``` for a single argument constructor. This might also be familiar to you as -~~~~ +``` fmap Ctor pa -~~~~ +``` or -~~~~ +``` Ctor `fmap` pa -~~~~ +``` All of which mean the same thing, just slightly different spellings. This can also be written using Monad operators: -~~~~ +``` pa >>= liftM ctor -~~~~ +``` or -~~~~ +``` liftM ctor =<< pa -~~~~ +``` -(liftM is in Control.Monad) +(`liftM` is in Control.Monad) -These effectively mean the same thing as the previous versions with -fmap and <$>. +These `liftM` versions effectively mean the same thing as the previous +versions with `fmap` and `<$>`. You can use -~~~~ +``` Ctor <$> pa <*> pb <*> pc -~~~~ +``` -for three args, and so on. So you use <$> between the pure Ctor and -the first arg, then <*> between each subsequent arg. +for three args, and so on. So you use `<$>` between the pure +constructor and the first arg, then `<*>` between each subsequent arg. Let's go over the simple expression parsers and try to rewrite them using this style. We will see a few other new functions. I will take a @@ -750,65 +844,68 @@ lot of time over the changes to the source. Here was the old parser. -> numA0 :: P.Parser SimpleExpr -> numA0 = do +> numA0 :: Parser SimpleExpr +> numA0 = lexeme $ do > n <- many1 digit -> whiteSpace > return $ Num $ read n Let's rewrite it in steps. The first step is to move the 'read' to the -first line using fmap (fmap is spelt <$> here). +first line using `<$>`. Maybe it is too obvious to state explicitly, +but I will do it anyway: you can use any pure function with `<$>` and +not just constructors. -> numA1 :: P.Parser SimpleExpr -> numA1 = do +> numA1 :: Parser SimpleExpr +> numA1 = lexeme $ do > n <- read <$> many1 digit -> whiteSpace > return $ Num n Now we can move the Num constructor: -> numA2 :: P.Parser SimpleExpr -> numA2 = do +> numA2 :: Parser SimpleExpr +> numA2 = lexeme $ do > n <- Num . read <$> many1 digit -> whiteSpace > return n -How can we get rid of the whitespace. Here is an additional operator (<*), which can be used: +How can we get rid of the whitespace. Here is an additional operator +`(<*)`, which can be used: -~~~~ +``` a <* b -~~~~ +``` is equivalent to -~~~~ +``` do x <- a void b return x -~~~~ +``` Here it is in use in the function. -> numA3 :: P.Parser SimpleExpr -> numA3 = do -> n <- Num . read <$> many1 digit <* whiteSpace +> numA3 :: Parser SimpleExpr +> numA3 = lexeme $ do +> n <- Num . read <$> many1 digit > return n Now we can apply the usual monad rewrite laws: -> numA4 :: P.Parser SimpleExpr -> numA4 = Num . read <$> many1 digit <* whiteSpace +> numA4 :: Parser SimpleExpr +> numA4 = lexeme (Num . read <$> many1 digit) In more industrial parser code, I would usually write some 'tokenization' parsers separately like this: -> integerA5 :: P.Parser Integer -> integerA5 = read <$> many1 digit <* whiteSpace +> integerA5 :: Parser Integer +> integerA5 = lexeme (read <$> many1 digit) + +TODO: should this read 'read <$> lexeme (many1 digit)\'? Can this +option possibly matter? Then the num expression parser looks like this: -> numA5 :: P.Parser SimpleExpr +> numA5 :: Parser SimpleExpr > numA5 = Num <$> integerA5 and we get a integer parser which we can reuse if we need to parse an @@ -818,26 +915,24 @@ integer in another context. here is the old var parser: -~~~~ -varD :: P.Parser SimpleExpr -varD = do +``` +varD :: Parser SimpleExpr +varD = lexeme $ do fl <- firstChar rest <- many nonFirstChar - whiteSpace return $ Var (fl:rest) where firstChar = satisfy (\a -> isLetter a || a == '_') nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_') -~~~~ +``` The first thing we can do is to make the firstChar and nonFirstChar a little easier to read: -> varA0 :: P.Parser SimpleExpr -> varA0 = do +> varA0 :: Parser SimpleExpr +> varA0 = lexeme $ do > fl <- firstChar > rest <- many nonFirstChar -> whiteSpace > return $ Var (fl:rest) > where > firstChar = letter <|> char '_' @@ -845,21 +940,21 @@ little easier to read: Now we can lift the (:) using the Applicative operators. -> varA1 :: P.Parser SimpleExpr +> varA1 :: Parser SimpleExpr > varA1 = do > i <- iden > return $ Var i > where -> iden = (:) <$> firstChar <*> many nonFirstChar <* whiteSpace +> iden = lexeme ((:) <$> firstChar <*> many nonFirstChar) > firstChar = letter <|> char '_' > nonFirstChar = digit <|> firstChar And tidy it up using <$> with the Var constructor: -> varA2 :: P.Parser SimpleExpr +> varA2 :: Parser SimpleExpr > varA2 = Var <$> iden > where -> iden = (:) <$> firstChar <*> many nonFirstChar <* whiteSpace +> iden = lexeme ((:) <$> firstChar <*> many nonFirstChar) > firstChar = letter <|> char '_' > nonFirstChar = digit <|> firstChar @@ -869,100 +964,86 @@ It looks almost like a grammar description now. Here is the starting point: -~~~~ -parensD :: P.Parser SimpleExpr +``` +parensD :: Parser SimpleExpr parensD = do - void $ char '(' - whiteSpace + void $ lexeme $ char '(' e <- simpleExprD - void $ char ')' - whiteSpace + void $ lexeme $ char ')' return $ Parens e -~~~~ - -Here you can see that there is a (*>) which works in the opposite -direction to (<*), and that these functions - (<*), (*>) - can be -chained. The precendence of these operators means that we have to use -some extra parentheses (!) here: - -> parensA0 :: P.Parser SimpleExpr -> parensA0 = Parens <$> (char '(' *> whiteSpace *> simpleExprD) -> <* char ')' <* whiteSpace - -Hopefully, learning more about parsing and implementing parsers will -give you a more subtle understanding of when we have to put -parentheses in Haskell code to make it parse correctly. Later, I will -show you how the different precedence of (*>) and (>>) - which -essentially do the same thing - means that in many situations using -(>>) instead of (*>) means you can leave out some parentheses. I'm not -sure if this is clever or not. HLint tells you something different to -received wisdom: HLint says always get rid of redundant parentheses, -and received wisdom says always put them in so you know exactly how -some code parses without having to know the operator precendences... -TODO: rewrite this text it is a mess +``` + +Here you can see that there is a `(*>)` which works in the opposite +direction to `(<*)`. The precendence of these operators means that we +have to use some extra parentheses (!) here: + +> parensA0 :: Parser SimpleExpr +> parensA0 = +> Parens <$> (lexeme (char '(') *> simpleExprD <* lexeme (char ')')) + +TODO: lost the chained <*. Put something below about this so there is +a concrete example == simple expr Here is the old version -~~~~ -simpleExprD :: P.Parser SimpleExpr +``` +simpleExprD :: Parser SimpleExpr simpleExprD = do e <- term maybeAddSuffix e where maybeAddSuffix e = addSuffix e <|> return e addSuffix e0 = do - void $ char '+' - whiteSpace + void $ lexeme $ char '+' e1 <- term maybeAddSuffix (Add e0 e1) term = numD <|> varD <|> parensD -~~~~ +``` Start with the function body, convert to point free style. -> simpleExprA0 :: P.Parser SimpleExpr +> simpleExprA0 :: Parser SimpleExpr > simpleExprA0 = term >>= maybeAddSuffix > where > maybeAddSuffix e = addSuffix e <|> return e > addSuffix e0 = do -> void $ char '+' -> whiteSpace +> void $ lexeme $ char '+' > e1 <- term > maybeAddSuffix (Add e0 e1) > term = numD <|> varD <|> parensD Now rewrite the main part of the addSuffix function: -> simpleExprA1 :: P.Parser SimpleExpr +> simpleExprA1 :: Parser SimpleExpr > simpleExprA1 = term >>= maybeAddSuffix > where > maybeAddSuffix e = addSuffix e <|> return e > addSuffix e0 = do -> e1 <- char '+' *> whiteSpace *> term +> e1 <- lexeme (char '+') *> term > maybeAddSuffix (Add e0 e1) > term = numD <|> varD <|> parensD now combine the Add ctor call into one line -> simpleExprA2 :: P.Parser SimpleExpr +> simpleExprA2 :: Parser SimpleExpr > simpleExprA2 = term >>= maybeAddSuffix > where > maybeAddSuffix e = addSuffix e <|> return e > addSuffix e0 = do -> ae <- Add e0 <$> (char '+' *> whiteSpace *> term) +> ae <- Add e0 <$> (lexeme (char '+') *> term) > maybeAddSuffix ae > term = numD <|> varD <|> parensD now simplify the addSuffix function to make it point free. -> simpleExprA3 :: P.Parser SimpleExpr +> simpleExprA3 :: Parser SimpleExpr > simpleExprA3 = term >>= maybeAddSuffix > where > maybeAddSuffix e = addSuffix e <|> return e > addSuffix e0 = -> Add e0 <$> (char '+' *> whiteSpace *> term) +> Add e0 <$> (lexeme (char '+') *> term) > >>= maybeAddSuffix > term = numD <|> varD <|> parensD @@ -973,63 +1054,62 @@ down to match this change. Here is the finished job for all the simple expression code. -> numA' :: P.Parser SimpleExpr -> numA' = Num . read <$> many1 digit <* whiteSpace +> numA' :: Parser SimpleExpr +> numA' = lexeme (Num . read <$> many1 digit) -> varA' :: P.Parser SimpleExpr -> varA' = Var <$> ((:) <$> firstChar <*> many nonFirstChar <* whiteSpace) +> varA' :: Parser SimpleExpr +> varA' = Var <$> lexeme ((:) <$> firstChar <*> many nonFirstChar) > where > firstChar = letter <|> char '_' > nonFirstChar = digit <|> firstChar -> parensA' :: P.Parser SimpleExpr -> parensA' = Parens <$> (char '(' *> whiteSpace *> -> simpleExprA' -> <* char ')' <* whiteSpace) +> parensA' :: Parser SimpleExpr +> parensA' = +> Parens <$> (lexeme (char '(') *> simpleExprA' <* lexeme (char ')')) -> simpleExprA' :: P.Parser SimpleExpr +> simpleExprA' :: Parser SimpleExpr > simpleExprA' = term >>= maybeAddSuffix > where > maybeAddSuffix e = addSuffix e <|> return e > addSuffix e0 = -> (Add e0 <$> (char '+' *> whiteSpace *> term)) >>= maybeAddSuffix +> (Add e0 <$> (lexeme (char '+') *> term)) >>= maybeAddSuffix > term = numA' <|> varA' <|> parensA' Here a version with the separate token parsers. The token parsers: -> identifier :: P.Parser String -> identifier = (:) <$> firstChar <*> many nonFirstChar <* whiteSpace +> identifier :: Parser String +> identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar) > where > firstChar = letter <|> char '_' > nonFirstChar = digit <|> firstChar -> integer :: P.Parser Integer -> integer = read <$> many1 digit <* whiteSpace +> integer :: Parser Integer +> integer = read <$> lexeme (many1 digit) -> symbol :: Char -> P.Parser () -> symbol c = void (char c <* whiteSpace) +> symbol :: Char -> Parser () +> symbol c = void $ lexeme $ char c Here is another little helper function. It barely pays its way in this short example, but even though it is only used once, I think it is worth it to make the code clearer. -> betweenParens :: P.Parser a -> P.Parser a +> betweenParens :: Parser a -> Parser a > betweenParens p = symbol '(' *> p <* symbol ')' The expression parsers: -> numA :: P.Parser SimpleExpr +> numA :: Parser SimpleExpr > numA = Num <$> integer -> varA :: P.Parser SimpleExpr +> varA :: Parser SimpleExpr > varA = Var <$> identifier -> parensA :: P.Parser SimpleExpr +> parensA :: Parser SimpleExpr > parensA = Parens <$> betweenParens simpleExprA -> simpleExprA :: P.Parser SimpleExpr +> simpleExprA :: Parser SimpleExpr > simpleExprA = term >>= maybeAddSuffix > where > maybeAddSuffix e = addSuffix e <|> return e @@ -1043,7 +1123,7 @@ think makes the code a bit simpler to follow. Here is a version of simpleExprA using the chainl1 function from Text.Parsec.Combinator: -> simpleExprAC :: P.Parser SimpleExpr +> simpleExprAC :: Parser SimpleExpr > simpleExprAC = chainl1 term op > where > op = symbol '+' *> return Add @@ -1051,9 +1131,9 @@ Text.Parsec.Combinator: The chainl1 simplified type is like this: -~~~~ +``` Parser a -> Parser (a -> a -> a) -> Parser a -~~~~ +``` You pass it the function to parse a single element, then a parser which parses the operator concrete syntax, and returns a function @@ -1073,9 +1153,9 @@ TODO: maybe write a diagram sequence to illustrate instead. The 'op' function here parses a plus symbol, then returns the Add ctor, whose type signature is what we want: -~~~~ +``` Add :: SimpleExpr -> SimpleExpr -> SimpleExpr -~~~~ +``` If you look at the source of chainl1, it is more or less the code above in simpleExprA factored out as a separate function, but written @@ -1087,23 +1167,24 @@ Now we will go through the Text.Parsec.Combinator module. I have mostly just written the original type signature, the simplified type signature, and reproduced the haddock documentation comment for each parser. Some have additional comments, and each has some example -usage. +usage. You should look at the source for these functions and try to +understand how they are implemented. TODO: examples for each, remove the 'Source', replace the type signatures with simplified ones. -~~~~ +``` choice :: [Parser a] -> Parser a -~~~~ +``` Haddock: choice ps tries to apply the parsers in the list ps in order, until one of them succeeds. Returns the value of the succeeding parser. We've already seen this one in use. -~~~~ +``` count :: Int -> Parser a -> Parser [a] -~~~~ +``` Haddock: count n p parses n occurrences of p. If n is smaller or equal to zero, the parser equals to return []. Returns a list of n values @@ -1111,20 +1192,20 @@ returned by p. TODO: examples -~~~~ +``` between :: Parser open -> Parser close -> Parser a -> Parser a -~~~~ +``` Haddock: between open close p parses open, followed by p and close. Returns the value returned by p. -~~~~ +``` braces = between (symbol "{") (symbol "}") -~~~~ +``` We can replace the betweenParens using this: -> betweenParens' :: P.Parser a -> P.Parser a +> betweenParens' :: Parser a -> Parser a > betweenParens' p = between (symbol '(') (symbol ')') p It hardly seems worth it, but if you are already familiar with @@ -1133,125 +1214,120 @@ when you are in a rush and coming back to code you wrote 3 years ago, for instance. -~~~~ +``` option :: a -> Parser a -> Parser a -~~~~ +``` Haddock: option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p. -The example given is written in a different style to this document: - -~~~~ +``` priority = option 0 (do{ d <- digit ; return (digitToInt d) }) -~~~~ - -Exercise: write the digitToInt function and rewrite this example using -<$>. +``` -~~~~ +``` optionMaybe :: Parser a -> Parser (Maybe a) -~~~~ +``` optionMaybe p tries to apply parser p. If p fails without consuming input, it returns Nothing, otherwise it returns Just the value returned by p. -~~~~ +``` optional :: Parser a -> Parser () -~~~~ +``` optional p tries to apply parser p. It will parse p or nothing. It only fails if p fails after consuming input. It discards the result of p. -~~~~ +``` skipMany1 :: Parser a -> Parser () -~~~~ +``` skipMany1 p applies the parser p one or more times, skipping its result. -~~~~ +``` many1 :: Parser a -> Parser [a] -~~~~ +``` many1 p applies the parser p one or more times. Returns a list of the returned values of p. -~~~~ +``` word = many1 letter -~~~~ +``` -~~~~ +``` sepBy :: Parser a -> Parser sep -> Parser [a] -~~~~ +``` sepBy p sep parses zero or more occurrences of p, separated by sep. Returns a list of values returned by p. -~~~~ +``` commaSep p = p `sepBy` (symbol ",") -~~~~ +``` -~~~~ +``` sepBy1 :: Parser a -> Parser sep -> Parser [a] -~~~~ +``` sepBy1 p sep parses one or more occurrences of p, separated by sep. Returns a list of values returned by p. -~~~~ +``` endBy :: Parser a -> Parser sep -> Parser [a] -~~~~ +``` endBy p sep parses zero or more occurrences of p, seperated and ended by sep. Returns a list of values returned by p. -~~~~ +``` cStatements = cStatement `endBy` semi -~~~~ +``` -~~~~ +``` endBy1 :: Parser a -> Parser sep -> Parser [a] -~~~~ +``` endBy1 p sep parses one or more occurrences of p, seperated and ended by sep. Returns a list of values returned by p. -~~~~ +``` sepEndBy :: Parser a -> Parser sep -> Parser [a] -~~~~ +``` sepEndBy p sep parses zero or more occurrences of p, separated and optionally ended by sep, ie. haskell style statements. Returns a list of values returned by p. -~~~~ +``` haskellStatements = haskellStatement `sepEndBy` semi -~~~~ +``` -~~~~ +``` sepEndBy1 :: Parser a -> Parser sep -> Parser [a] -~~~~ +``` sepEndBy1 p sep parses one or more occurrences of p, separated and optionally ended by sep. Returns a list of values returned by p. -~~~~ +``` chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a -~~~~ +``` chainl p op x parser zero or more occurrences of p, separated by op. Returns a value obtained by a left associative application of all functions returned by op to the values returned by p. If there are zero occurrences of p, the value x is returned. -~~~~ +``` chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a -~~~~ +``` chainl1 p op x parser one or more occurrences of p, separated by op Returns a value obtained by a left associative application of all @@ -1261,40 +1337,40 @@ occurs in expression grammars. TODO: reimplement the Add suffix parser using chainl1 -~~~~ +``` chainr :: Parser a -> Parser (a -> a -> a) -> a -> Parser a -~~~~ +``` chainr p op x parser zero or more occurrences of p, separated by op Returns a value obtained by a right associative application of all functions returned by op to the values returned by p. If there are no occurrences of p, the value x is returned. -~~~~ +``` chainr1 :: Parser a -> Parser (a -> a -> a) -> Parser a -~~~~ +``` chainr1 p op x parser one or more occurrences of |p|, separated by op Returns a value obtained by a right associative application of all functions returned by op to the values returned by p. -~~~~ +``` eof :: Parser () -~~~~ +``` This parser only succeeds at the end of the input. This is not a primitive parser but it is defined using notFollowedBy. -~~~~ +``` eof = notFollowedBy anyToken "end of input" -~~~~ +``` The () operator is used for error messages. We will come back to error messages after writing the basic SQL parser. -~~~~ +``` notFollowedBy :: Show a => Parser a -> Parser () -~~~~ +``` notFollowedBy p only succeeds when parser p fails. This parser does not consume any input. This parser can be used to implement the @@ -1304,41 +1380,41 @@ legal identifier character, in which case the keyword is actually an identifier (for example lets). We can program this behaviour as follows: -~~~~ +``` keywordLet = try (do{ string "let" ; notFollowedBy alphaNum }) -~~~~ +``` -~~~~ +``` manyTill :: Parser a -> Parser end -> Parser [a] -~~~~ +``` manyTill p end applies parser p zero or more times until parser end succeeds. Returns the list of values returned by p. This parser can be used to scan comments: -~~~~ +``` simpleComment = do{ string "")) } -~~~~ +``` Note the overlapping parsers anyChar and string "-->", and therefore the use of the try combinator. -~~~~ +``` lookAhead :: Parser a -> Parser a -~~~~ +``` lookAhead p parses p without consuming any input. If p fails and consumes some input, so does lookAhead. Combine with try if this is undesirable. -~~~~ +``` anyToken :: Parser Char -~~~~ +``` The parser anyToken accepts any kind of token. It is for example used to implement eof. Returns the accepted token. @@ -1354,7 +1430,7 @@ parse a keyword and return a no argument constructor: > data Something = Type1 | Type2 | Type3 -> something :: P.Parser Something +> something :: Parser Something > something = choice [Type1 <$ string "type1" > ,Type2 <$ string "type2" > ,Type3 <$ string "type3"] @@ -1373,8 +1449,8 @@ The basic parse function: this is a pretty simple wrapper. The parse function from parsec just adds a filename to use in parse errors, which is set as the empty string here. -> regularParse :: P.Parser a -> String -> Either P.ParseError a -> regularParse p = P.parse p "" +> regularParse :: Parser a -> String -> Either ParseError a +> regularParse p = parse p "" 'parse' is a basic function in the family of functions for running parsers in Parsec. You can compose the parser functions in the Parser @@ -1388,14 +1464,14 @@ Identity a'). Have a look in the Text.Parsec.Prim module for these This function will run the parser, but additionally fail if it doesn't consume all the input. -> parseWithEof :: P.Parser a -> String -> Either P.ParseError a -> parseWithEof p = P.parse (p <* eof) "" +> parseWithEof :: Parser a -> String -> Either ParseError a +> parseWithEof p = parse (p <* eof) "" This function will apply the parser, then also return any left over input which wasn't parsed. -> parseWithLeftOver :: P.Parser a -> String -> Either P.ParseError (a,String) -> parseWithLeftOver p = P.parse ((,) <$> p <*> leftOver) "" +> parseWithLeftOver :: Parser a -> String -> Either ParseError (a,String) +> parseWithLeftOver p = parse ((,) <$> p <*> leftOver) "" > where leftOver = manyTill anyToken eof TODO: what happens when you use 'many anyToken <* eof' variations @@ -1404,26 +1480,26 @@ instead? Here is the main testing function used, which includes ignoring prefix whitespace. -> parseWithWSEof :: P.Parser a -> String -> Either P.ParseError a +> parseWithWSEof :: Parser a -> String -> Either ParseError a > parseWithWSEof p = parseWithEof (whiteSpace *> p) -= P.Parser += Parser -The definition of P.Parser and a partial explanation of the full type +The definition of Parser and a partial explanation of the full type signature. -~~~~ +``` type Parser = Parsec String () -~~~~ +``` This means that a function returning Parser a parses from a String with () as the initial state. The Parsec type is defined like this: -~~~~ +``` type Parsec s u = ParsecT s u Identity -~~~~ +``` ParsecT is a monad transformer, I think it is the primitive one in the Parsec library, and the 'Parsec' type is a type alias which sets the @@ -1436,9 +1512,9 @@ underlying monad m and return type a. The full types that you see like this: -~~~~ +``` satisfy :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char -~~~~ +``` refer to the same things (stream type s, user state type u, underlying monad m). @@ -1448,9 +1524,15 @@ the user state type (this effectively means no user state, since () only has one value), and the underlying monad is Identity: we are using no other underlying monad, so 'Parser a' expands to: -~~~~ +``` ParsecT String () Identity a -~~~~ +``` I.e. the source is String, the user state is (), and the underlying monad is Identity. + += Other information + +TODO: Here is some other information on Parsec and Haskell: +links, tutorials on fp, section in rwh, lyah?, old parsec docs, +parsec docs on hackage, other parser combinator libs (uu, trifecta?) diff --git a/QueryExpressions.lhs b/QueryExpressions.lhs index c9c4a97..1f2bb59 100644 --- a/QueryExpressions.lhs +++ b/QueryExpressions.lhs @@ -10,7 +10,7 @@ TODO: qualify or add explicit imports > --import Text.Groom (groom) > --import qualified Text.Parsec as P -> import qualified Text.Parsec.String as P +> import Text.Parsec.String (Parser) > import Text.Parsec (try,optionMaybe, optional, sepBy1,option) > import Control.Applicative ((<$>),(*>),(<*>)) > --import Control.Monad (void,guard) @@ -115,7 +115,7 @@ select [value expr] > parseSingleSelectItemTestData = > [("select 1", makeSelect {qeSelectList = [(NumberLiteral 1,Nothing)]})] -> singleSelectItem :: P.Parser QueryExpr +> singleSelectItem :: Parser QueryExpr > singleSelectItem = do > keyword_ "select" > e <- valueExpr @@ -124,14 +124,14 @@ select [value expr] Here is an example where rewriting to use applicative can make the parser code much less clear: -> singleSelectItemApplicative :: P.Parser QueryExpr +> singleSelectItemApplicative :: Parser QueryExpr > singleSelectItemApplicative = > (\sl -> makeSelect {qeSelectList = sl}) > <$> (keyword_ "select" *> (((:[]) . (,Nothing)) <$> valueExpr)) Using a helper function can make this version more readable: -> singleSelectItemApplicative' :: P.Parser QueryExpr +> singleSelectItemApplicative' :: Parser QueryExpr > singleSelectItemApplicative' = > ms <$> (keyword_ "select" *> valueExpr) > where @@ -152,12 +152,12 @@ select 1+2, 3+4; > ,(Identifier "b",Nothing)]}) > ,("select 1+2,3+4" > ,makeSelect {qeSelectList = -> [(Op "+" [NumberLiteral 1,NumberLiteral 2],Nothing) -> ,(Op "+" [NumberLiteral 3,NumberLiteral 4],Nothing)]}) +> [(BinaryOp (NumberLiteral 1) "+" (NumberLiteral 2),Nothing) +> ,(BinaryOp (NumberLiteral 3) "+" (NumberLiteral 4),Nothing)]}) > ] -> selectMultipleItems :: P.Parser QueryExpr +> selectMultipleItems :: Parser QueryExpr > selectMultipleItems = do > keyword_ "select" > es <- commaSep1 valueExpr @@ -165,7 +165,7 @@ select 1+2, 3+4; Here is another helper parser which will be used a lot. -> commaSep1 :: P.Parser a -> P.Parser [a] +> commaSep1 :: Parser a -> Parser [a] > commaSep1 = (`sepBy1` symbol_ ",") == aliases @@ -194,20 +194,20 @@ blacklist identifier parser Finally, here is the select list parser and the helper for select items: -> selectItem :: P.Parser (ValueExpr, Maybe String) +> selectItem :: Parser (ValueExpr, Maybe String) > selectItem = (,) <$> valueExpr <*> optionMaybe (try alias) > where alias = optional (try (keyword_ "as")) *> identifierString TODO: note about optional in parsec and in applicative -> selectList :: P.Parser [(ValueExpr, Maybe String)] +> selectList :: Parser [(ValueExpr, Maybe String)] > selectList = try (keyword_ "select") *> commaSep1 selectItem = simplified from clause -> from :: P.Parser [TableRef] +> from :: Parser [TableRef] > from = option [] (try (keyword_ "from") *> (mkFrom <$> identifierString)) > where mkFrom f = [SimpleTableRef f] @@ -226,10 +226,10 @@ The where, group by, having, and order by parsers are simple. > [("select a from t where a = 5" > ,makeSelect {qeSelectList = [(Identifier "a",Nothing)] > ,qeFrom = [SimpleTableRef "t"] -> ,qeWhere = Just $ Op "=" [Identifier "a", NumberLiteral 5]}) +> ,qeWhere = Just $ BinaryOp (Identifier "a") "=" (NumberLiteral 5)}) > ] -> swhere :: P.Parser (Maybe ValueExpr) +> swhere :: Parser (Maybe ValueExpr) > swhere = optionMaybe (try (keyword_ "where") *> valueExpr) = group by @@ -251,7 +251,7 @@ The where, group by, having, and order by parsers are simple. > }) > ] -> sgroupBy :: P.Parser [ValueExpr] +> sgroupBy :: Parser [ValueExpr] > sgroupBy = option [] (try (keyword_ "group") > *> keyword_ "by" > *> commaSep1 valueExpr) @@ -265,11 +265,11 @@ The where, group by, having, and order by parsers are simple. > ,(App "sum" [Identifier "b"],Nothing)] > ,qeFrom = [SimpleTableRef "t"] > ,qeGroupBy = [Identifier "a"] -> ,qeHaving = Just $ Op ">" [App "sum" [Identifier "b"], NumberLiteral 5] +> ,qeHaving = Just $ BinaryOp (App "sum" [Identifier "b"]) ">" (NumberLiteral 5) > }) > ] -> having :: P.Parser (Maybe ValueExpr) +> having :: Parser (Maybe ValueExpr) > having = optionMaybe (try (keyword_ "having") *> valueExpr) = order by @@ -286,14 +286,14 @@ The where, group by, having, and order by parsers are simple. > ,qeFrom = [SimpleTableRef "t"] > ,qeOrderBy = o} -> orderBy :: P.Parser [ValueExpr] +> orderBy :: Parser [ValueExpr] > orderBy = option [] (try (keyword_ "order") > *> keyword_ "by" > *> commaSep1 valueExpr) = putting together the query expression parser -> queryExpr :: P.Parser QueryExpr +> queryExpr :: Parser QueryExpr > queryExpr = > Select > <$> selectList diff --git a/Text/Parsec/String/Char.hs b/Text/Parsec/String/Char.hs new file mode 100644 index 0000000..450e9e9 --- /dev/null +++ b/Text/Parsec/String/Char.hs @@ -0,0 +1,64 @@ + +module Text.Parsec.String.Char where + +{- + +Wrappers for the Text.Parsec.Char module with the types fixed to +'Text.Parsec.String.Parser a', i.e. the stream is String, no user +state, Identity monad. + +-} + +import qualified Text.Parsec.Char as C +import Text.Parsec.String (Parser) + +spaces :: Parser () +spaces = C.spaces + +space :: Parser Char +space = C.space + +newline :: Parser Char +newline = C.newline + +tab :: Parser Char +tab = C.tab + +upper :: Parser Char +upper = C.upper + +lower :: Parser Char +lower = C.upper + +alphaNum :: Parser Char +alphaNum = C.upper + +letter :: Parser Char +letter = C.letter + +digit :: Parser Char +digit = C.digit + +hexDigit :: Parser Char +hexDigit = C.hexDigit + +octDigit :: Parser Char +octDigit = C.octDigit + +char :: Char -> Parser Char +char = C.char + +string :: String -> Parser String +string = C.string + +anyChar :: Parser Char +anyChar = C.anyChar + +oneOf :: [Char] -> Parser Char +oneOf = C.oneOf + +noneOf :: [Char] -> Parser Char +noneOf = C.noneOf + +satisfy :: (Char -> Bool) -> Parser Char +satisfy = C.satisfy diff --git a/Text/Parsec/String/Combinator.hs b/Text/Parsec/String/Combinator.hs new file mode 100644 index 0000000..373c5c6 --- /dev/null +++ b/Text/Parsec/String/Combinator.hs @@ -0,0 +1,84 @@ + +module Text.Parsec.String.Combinator where + +{- + +Wrappers for the Text.Parsec.Combinator module with the types fixed to +'Text.Parsec.String.Parser a', i.e. the stream is String, no user +state, Identity monad. + +-} + +import qualified Text.Parsec.Combinator as C +import Text.Parsec.String (Parser) + +choice :: [Parser a] -> Parser a +choice = C.choice + + +count :: Int -> Parser a -> Parser [a] +count = C.count + +between :: Parser open -> Parser close -> Parser a -> Parser a +between = C.between + + +option :: a -> Parser a -> Parser a +option = C.option + +optionMaybe :: Parser a -> Parser (Maybe a) +optionMaybe = C.optionMaybe + +optional :: Parser a -> Parser () +optional = C.optional + +skipMany1 :: Parser a -> Parser () +skipMany1 = C.skipMany1 + +many1 :: Parser a -> Parser [a] +many1 = C.many1 + +sepBy :: Parser a -> Parser sep -> Parser [a] +sepBy = C.sepBy + +sepBy1 :: Parser a -> Parser sep -> Parser [a] +sepBy1 = C.sepBy1 + +endBy :: Parser a -> Parser sep -> Parser [a] +endBy = C.endBy + +endBy1 :: Parser a -> Parser sep -> Parser [a] +endBy1 = C.endBy1 + +sepEndBy :: Parser a -> Parser sep -> Parser [a] +sepEndBy = C.sepEndBy + +sepEndBy1 :: Parser a -> Parser sep -> Parser [a] +sepEndBy1 = C.sepEndBy1 + +chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a +chainl = C.chainl + +chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a +chainl1 = C.chainl1 + +chainr :: Parser a -> Parser (a -> a -> a) -> a -> Parser a +chainr = C.chainr + +chainr1 :: Parser a -> Parser (a -> a -> a) -> Parser a +chainr1 = C.chainr1 + +eof :: Parser () +eof = C.eof + +notFollowedBy :: Show a => Parser a -> Parser () +notFollowedBy = C.notFollowedBy + +manyTill :: Parser a -> Parser end -> Parser [a] +manyTill = C.manyTill + +lookAhead :: Parser a -> Parser a +lookAhead = C.lookAhead + +anyToken :: Parser Char +anyToken = C.anyToken diff --git a/Text/Parsec/String/Expr.hs b/Text/Parsec/String/Expr.hs new file mode 100644 index 0000000..3ebc68b --- /dev/null +++ b/Text/Parsec/String/Expr.hs @@ -0,0 +1,28 @@ + +module Text.Parsec.String.Expr (buildExpressionParser + ,Operator(..) + ,OperatorTable + ,E.Assoc(..) + )where + +{- + +Wrappers for the Text.Parsec.Expr module with simplified types. + +-} + +import Text.Parsec.String (Parser) +import qualified Text.Parsec.Expr as E + +data Operator a = Infix (Parser (a -> a -> a)) E.Assoc + | Prefix (Parser (a -> a)) + | Postfix (Parser (a -> a)) + +type OperatorTable a = [[Operator a]] + +buildExpressionParser :: OperatorTable a -> Parser a -> Parser a +buildExpressionParser t = E.buildExpressionParser (map (map f) t) + where + f (Infix p a) = E.Infix p a + f (Prefix p) = E.Prefix p + f (Postfix p) = E.Postfix p diff --git a/Text/Parsec/String/Parsec.hs b/Text/Parsec/String/Parsec.hs new file mode 100644 index 0000000..43ed85a --- /dev/null +++ b/Text/Parsec/String/Parsec.hs @@ -0,0 +1,18 @@ + +module Text.Parsec.String.Parsec where + +{- + +wrapper for some Text.Parsec functions which use a simplified type + +-} + +import Text.Parsec.String (Parser) +import qualified Text.Parsec as P + + +try :: Parser a -> Parser a +try = P.try + +parse :: Parser a -> P.SourceName -> String -> Either P.ParseError a +parse = P.parse diff --git a/ValueExpressions.lhs b/ValueExpressions.lhs index 05ddaf8..ac6f8ce 100644 --- a/ValueExpressions.lhs +++ b/ValueExpressions.lhs @@ -8,15 +8,16 @@ to the simple expressions used in the last tutorial. > module ValueExpressions where > import Text.Groom (groom) -> import qualified Text.Parsec as P -> import qualified Text.Parsec.String as P -> import Text.Parsec (oneOf,digit,many1,string,manyTill,anyChar,eof,choice,char,letter,between,sepBy,try,optionMaybe,alphaNum) +> import Text.Parsec.String (Parser) +> import Text.Parsec.String.Char (oneOf, digit, string, anyChar, char, letter, alphaNum) +> import Text.Parsec.String.Combinator (many1, manyTill, eof, choice, between, sepBy, optionMaybe) +> import Text.Parsec.String.Parsec(try,parse) + > import Control.Applicative (many, (<*),(<$>), (*>), (<|>),(<$),(<*>)) > import Control.Monad (void,guard) > --import Debug.Trace > import Data.List (intercalate) -> import qualified Text.Parsec.Expr as E -> import Control.Monad.Identity (Identity) +> import qualified Text.Parsec.String.Expr as E Our value expressions will support literals, identifiers, asterisk, some simple operators, case expression and parentheses. Here is the @@ -170,7 +171,7 @@ parsers for each of these variants. Here is the whitespace parser which skips comments also -> whiteSpace :: P.Parser () +> whiteSpace :: Parser () > whiteSpace = > choice [simpleWhiteSpace *> whiteSpace > ,lineComment *> whiteSpace @@ -203,40 +204,40 @@ It doesn't support non integral number literals. We already saw how to write these parsers: -> integer :: P.Parser Integer +> integer :: Parser Integer > integer = read <$> many1 digit <* whiteSpace -> integerLiteral :: P.Parser ValueExpr +> integerLiteral :: Parser ValueExpr > integerLiteral = NumberLiteral <$> integer String literals: -> stringLiteral :: P.Parser ValueExpr +> stringLiteral :: Parser ValueExpr > stringLiteral = StringLiteral <$> (symbol_ "'" *> manyTill anyChar (symbol_ "'")) Here is the symbol parser. I've created a wrapper which uses void which can be used to avoid writing void in lots of places. -> symbol :: String -> P.Parser String +> symbol :: String -> Parser String > symbol s = string s <* whiteSpace -> symbol_ :: String -> P.Parser () +> symbol_ :: String -> Parser () > symbol_ s = void $ symbol s TODO: suffix issues with symbol parser Here is the parser which can parse either kind of literal: -> literal :: P.Parser ValueExpr +> literal :: Parser ValueExpr > literal = integerLiteral <|> stringLiteral Here is a small helper function to check the examples above. I've put the test data and the parser as parameters so we can reuse it later. -> checkParse :: (Eq a, Show a) => P.Parser a -> [(String,a)] -> IO () +> checkParse :: (Eq a, Show a) => Parser a -> [(String,a)] -> IO () > checkParse parser testData = do > let -- create a wrapper function which uses the parser function -> parseit = P.parse (whiteSpace *> parser <* eof) "" +> parseit = parse (whiteSpace *> parser <* eof) "" > -- parse all the input strings > parsed = map (parseit . fst) testData > triples = zip testData parsed @@ -290,7 +291,7 @@ Some examples: Here a parser for the identifier token -> identifierString' :: P.Parser String +> identifierString' :: Parser String > identifierString' = > (:) <$> letterOrUnderscore > <*> many letterDigitOrUnderscore <* whiteSpace @@ -300,7 +301,7 @@ Here a parser for the identifier token and a parser for identifier expressions -> identifier' :: P.Parser ValueExpr +> identifier' :: Parser ValueExpr > identifier' = Identifier <$> identifierString We can check these parsers at the ghci prompt using 'checkParse @@ -314,7 +315,7 @@ There will be an issue with this parser which will be covered later. > parseDottedIdentifierTestData = > [("t.a", DIdentifier "t" "a")] -> dottedIdentifier :: P.Parser ValueExpr +> dottedIdentifier :: Parser ValueExpr > dottedIdentifier = DIdentifier <$> identifierString > <*> (symbol_ "." *> identifierString) @@ -325,7 +326,7 @@ There will be an issue with this parser which will be covered later. > [("*", Star) > ,("t.*", DStar "t")] -> star :: P.Parser ValueExpr +> star :: Parser ValueExpr > star = choice [Star <$ symbol_ "*" > ,DStar <$> (identifierString <* symbol_ "." <* symbol_ "*")] @@ -341,15 +342,15 @@ function application: f(), f(a), f(a,b), etc. The valueExpr parser will appear later. -> app :: P.Parser ValueExpr +> app :: Parser ValueExpr > app = App <$> identifierString <*> parens (commaSep valueExpr) There are two new helper parsers: -> parens :: P.Parser a -> P.Parser a +> parens :: Parser a -> Parser a > parens = between (symbol_ "(") (symbol_ ")") -> commaSep :: P.Parser a -> P.Parser [a] +> commaSep :: Parser a -> Parser [a] > commaSep = (`sepBy` symbol_ ",") == case @@ -376,7 +377,7 @@ Here are the examples/tests for case. Here is the parser: -> scase :: P.Parser ValueExpr +> scase :: Parser ValueExpr > scase = > Case <$> (try (keyword_ "case") *> optionMaybe (try valueExpr)) > <*> many1 swhen @@ -391,10 +392,10 @@ same issue as the symbol parser regarding valid suffix characters. It suffers from more issues since e.g. keyword 'select' will parse this string 'selectx', which is even more wrong. -> keyword :: String -> P.Parser String +> keyword :: String -> Parser String > keyword s = string s <* whiteSpace -> keyword_ :: String -> P.Parser () +> keyword_ :: String -> Parser () > keyword_ s = keyword s *> return () In fact, it's the same as the symbol parser. We can use this for now @@ -404,7 +405,7 @@ which uses them. TODO: put in the identifier with blacklist here: the when issue. -> identifierString :: P.Parser String +> identifierString :: Parser String > identifierString = do > s <- (:) <$> letterOrUnderscore > <*> many letterDigitOrUnderscore <* whiteSpace @@ -420,14 +421,14 @@ TODO: put in the identifier with blacklist here: the when issue. TODO: talk about what must be in the blacklist, and what doesn't need to be. This should be later. -> identifier :: P.Parser ValueExpr +> identifier :: Parser ValueExpr > identifier = Identifier <$> identifierString TODO: follow up on try, error messages. = parens -> sparens :: P.Parser ValueExpr +> sparens :: Parser ValueExpr > sparens = Parens <$> parens valueExpr = operators @@ -435,7 +436,7 @@ TODO: follow up on try, error messages. Here is our operator table. I followed the precedences given here: http://www.postgresql.org/docs/9.3/static/sql-syntax-lexical.html#SQL-PRECEDENCE -> table :: [[E.Operator String () Identity ValueExpr]] +> table :: [[E.Operator ValueExpr]] > table = [[binary "*" E.AssocLeft > ,binary "/" E.AssocLeft] > ,[binary "+" E.AssocLeft @@ -472,7 +473,7 @@ expression parser tutorial? Here is the value expr parser: -> valueExpr :: P.Parser ValueExpr +> valueExpr :: Parser ValueExpr > valueExpr = E.buildExpressionParser table term > where > term = choice [literal diff --git a/plan b/plan index e2f228e..549f55b 100644 --- a/plan +++ b/plan @@ -1,11 +1,27 @@ TODO: +replace ~~~~ with ``` which seems to be more standard and works on +fpcomplete site as well as with pandoc + +make request for fpcomplete tutorials to support > for lhs files +instead of for blockquotes. Otherwise, will have to write a filter +which wraps > sections with ```haskell before pasting into fpcomplete. + +investigate how embedded code can be made runnable within the tutorial +on fpcomplete + +add ` for embedded code + review and remove explicit explanations where feasible, replace with more examples -try and create a script to render, something like: -pandoc --from=markdown+lhs --to=html ParsingIntroduction.lhs -o ParsingIntroduction.lhs.html -c main.css --toc -for all the files. Investigate how it will look when uploaded to fp +break up parsing introduction into several docs, it's huge + +maybe have a sequence of docs for expression parsing, in the middle +will be the simple sql value expression parser + +maybe split parsecextras file into three also? + documents which need finishing off: @@ -44,12 +60,18 @@ error messages: explore what error messages it produces and look at how the parsers can be rearranged for better error messages +more on parsec internals? Maybe implementing a simple parser +combinator library, then understand the implementation of parsec and +choices + refactored project: create a new branch in simple-sql-parser, document how the files are created, laid out and the cabal and extra stuff created. Maybe talk about the website also. links to stuff about creating packages +annotated run through of the ANSI SQL grammar + extending value expressions extend these to add all the bits from ansi sql 2003 (+cherry pick