-
Notifications
You must be signed in to change notification settings - Fork 1
Scanning in a Nutshell
If you dug this up on your own, maybe you already know this. If you're curious, here is how and why:
Recovering the "words" from a text, identifying their categories and denotations, and filtering out the noise.
Normally the structure of words is given by a set of regular expressions with some rule for disambiguation. Normally that rule is "longest-match prevails" but there are exceptions. Regular expressions are simple to translate to non-deterministic finite-state machines (FSM). A "subset-construction" builds a deterministic FSM from a non-deterministic one: it's an up-front cost paid to speed up the scanning process. Ordinarily a next phase computes the smallest-possible equivalent DFSM (this is called "minimization") first by number of states, and then by finding equivalence classes among letters of the original alphabet (which means less to store). Finally, we convert the DFSM into a compact representation which remains acceptably-quick to query.
A DFSM-based scanner matches the text against the FSM: Each time the FSM is "accepting" then the scanner updates its idea of the longest match (and which rule matched specifically). Whenever the FSM is unable to accept the next letter, the match is from the starting position to the most recent acceptance, and scanning starts over at the next letter after the match (which may involve backtracking).
When the scanner emits a match, it must route to the correct response for the particular matched rule. That response is typically written in a general-purpose programming language.
A start condition is just a pair of start-state indexes: one special for the "begin-of-line" condition, and one for everywhere else. The scanner tests for begin-of-line before scanning each token.
Rules are annotated with the size of the trailing context (if constant) or perhaps the leading part, if it's constant and the trailing context is variable. It would be possible to support variable-both at the cost of considerable extra complexity; I don't think it's worth the effort right now.
These are handled during the subset-construction: Each non-deterministic state knows the priority of the rule it leads to, and the minimum eligible priority of successor-states thus becomes part of the "label" associated with finding deterministic states. Within the same priority, the earliest-defined rule prevails: Rules are numbered according to the order in which they were defined, so when both match decision among them is trivial.