-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support named capture groups #112
base: master
Are you sure you want to change the base?
Conversation
Can anybody help add reviewers? (Also approve running workflows) |
PTAL @alainfrisch @Drup @pmetzger |
src/syntax/ppx_sedlex.ml
Outdated
with Not_found -> | ||
err p.ppat_loc (Printf.sprintf "unbound regexp %s" x) | ||
end | ||
| Ppat_alias (p, ({ txt = x } as x_loc)) when allow_alias -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give a better error for Ppat_alias _ when not allow_alias
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still stand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been addressed.
src/syntax/ppx_sedlex.ml
Outdated
@@ -489,9 +667,13 @@ let mapper = | |||
let l, regexps' = sub#structure_with_regexps (!previous @ l) in | |||
let parts = List.map partition (get_partitions ()) in | |||
let tables = List.map table (get_tables ()) in | |||
let funcs = | |||
let loc = default_loc in | |||
[%str let __sedlex_tl = function _ :: tl -> tl | _ -> assert false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this in the lib, similar to what we do for Sedlexing.__private__next_int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not relevant anymore
src/syntax/ppx_sedlex.ml
Outdated
let brs = | ||
Array.of_list | ||
(List.map | ||
(fun ((r, s), e) -> ((r, List.of_seq (StrLocSet.to_seq s)), e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StrLocSet.elements
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really comment on the implementation but, for the API, is there a benefit in exporting offset
/length
vs simply binding the resulting match string?
Could you give more details about the current implementation ? Why is it done this way ? Can't we update positions as we traverse the automation, instead of doing a second pass based on some trace ? Why is the trace processed in reverse order ? |
There is a lot of room for optimization. Offset tracking could be shared between bindings. For example, in Some offset can be computer statically. In the regexp above, start of a is 0, end of b is the end of the lexeme. Have you considered such optimizations ? |
Returns an unexpected result:
|
I think I would be easier if your trace was maintaining pairs of pos, instead of pos + len. |
@hhugo Thanks for your considerate suggestions! Let me do it one by one and start from merging the alias offsets.
I'm not sure if I'm able to support optimizations in such static cases. I'll think about it. Another thing yet to do is to stop tracking the path if the matching branches have no alias at all. |
src/syntax/ppx_sedlex.ml
Outdated
(fun action offsets -> | ||
try | ||
let i = Hashtbl.find cases2offset offsets in | ||
Hashtbl.replace cases2offset offsets i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Hashtbl.replace
doesn't seem to be doing anything.
src/syntax/sedlex.mli
Outdated
prev_state : int; | ||
prev_node : int; | ||
char_set : Sedlex_cset.t; | ||
actions : (string * bool) list; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe declare a separate type for action ?
type action = Start of string | End of string
?
src/syntax/ppx_sedlex.ml
Outdated
| [] -> e | ||
| aliases -> | ||
let loc = default_loc in | ||
let _, action_offsets = Option.get offsets.(i) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of matching on aliases
above, we could directly match on offsets.(i)
let gen_aliases lexbuf offsets i e aliases =
match offsets.(i) with
| None -> e
| Some (_, action_offsets) ->
let loc = default_loc in
pexp_let ~loc Nonrecursive
[
value_binding ~loc
~pat:[%pat? __sedlex_offsets]
~expr:
(appfun (trace_fun i) [evar ~loc lexbuf; [%expr __sedlex_path]]);
]
@@ pexp_let ~loc Nonrecursive
(List.map
(fun { txt = alias; loc } ->
let start = Hashtbl.find action_offsets (alias, true) in
let stop = Hashtbl.find action_offsets (alias, false) in
value_binding ~loc ~pat:(pvar ~loc alias)
~expr:
[%expr
__sedlex_offsets.([%e eint ~loc start]),
__sedlex_offsets.([%e eint ~loc stop])
- __sedlex_offsets.([%e eint ~loc start])])
aliases)
@@ e
with Not_found -> Hashtbl.add action2cases action [i]) | ||
actions) | ||
cases; | ||
let counter = ref 0 in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counter could be replace by Hastbl.length alias2offset
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore me
src/syntax/ppx_sedlex.ml
Outdated
let offset_array = | ||
value_binding ~loc | ||
~pat:[%pat? __sedlex_offsets] | ||
~expr:(pexp_array ~loc (List.init offsets_num (fun _ -> [%expr 0]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use an invalid position here (e.g. '-1`) as it would make it more obvious if we ever encounter a bug where position are not updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done
src/syntax/sedlex.ml
Outdated
let node_j = Hashtbl.find nodes_idx from_node in | ||
let rec dfs cset actions to_node = | ||
let node_i = Hashtbl.find nodes_idx to_node in | ||
try ignore (Hashtbl.find cases (i, node_i, j, cset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hashtbl.mem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done
src/syntax/sedlex.ml
Outdated
|
||
let compile_traces states (start, final) = | ||
let counter = ref 0 in | ||
let nodes_idx = Hashtbl.create 31 in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are nodes_idx
different from node.id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I forgot there's already an id in node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
let i = Hashtbl.find nodes_idx node in | ||
let actions = append_action actions node.action in | ||
let cases = | ||
if node.trans <> [] || node == final then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining these conditions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it later.
let r1, s1 = aux allow_alias p1 in | ||
let r2, s2 = aux allow_alias p2 in | ||
if not (StrLocSet.equal s1 s2) then begin | ||
let x = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about providing the full list of problematic bindings ?
(e.g StrLocSet.(diff (union s1 s2) (inter s1 s2))
)
(fun (r1, s1) p -> | ||
let r2, s2 = aux allow_alias p in | ||
if not (StrLocSet.disjoint s1 s2) then begin | ||
let x = StrLocSet.choose (StrLocSet.inter s1 s2) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we could return the full list of duplicated bindings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to report one location at a time, just like the OCaml compiler behaves. (One name corresponds to one location)
src/syntax/sedlex.ml
Outdated
(fun to_state i -> | ||
List.iter | ||
(fun from_node -> | ||
try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this try-with necessary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It caused some problem when I removed nodes_idx
. Now it's fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the issue and explain what are the irrelevant nodes ?
[ | ||
pint ~loc curr_state; | ||
pint ~loc curr_node; | ||
pint ~loc prev_state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is prev_state
ever needed to choose the action ?
On a trivial example
let rec token buf =
match%sedlex buf with
| (Star "a" as a), (Plus "b" as b), "c" ->
print_alias buf "a" a;
print_alias buf "b" b;
token buf
| eof -> print_endline "EOF"
| _ -> failwith "Unexpected character"
curr_state
and curr_node
are enough for choosing the action.
The following
let __sedlex_code = Sedlexing.lexeme_code buf (__sedlex_pos - 1) in
(match (__sedlex_curr_state, __sedlex_curr_node,
__sedlex_prev_state)
with
| (2, 6, 0) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest)
| (0, 6, 2) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest)
| (3, 6, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 6 __sedlex_rest
| (4, 1, 3) -> __sedlex_aux (__sedlex_pos - 1) 3 2 __sedlex_rest
| (3, 2, 0) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 0 6 __sedlex_rest)
| (3, 6, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 6 __sedlex_rest
| (2, 6, 2) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest)
| (3, 2, 2) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 2 6 __sedlex_rest)
| (3, 2, 3) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 3 6 __sedlex_rest)
| (3, 6, 3) -> __sedlex_aux (__sedlex_pos - 1) 3 6 __sedlex_rest
| (0, 10, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest
| (0, 6, 0) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest)
| (2, 10, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest
| (0, 10, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest
| (2, 10, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest
| _ -> assert false) in
can be rewritten into
let __sedlex_code = Sedlexing.lexeme_code buf (__sedlex_pos - 1) in
(match (__sedlex_curr_state, __sedlex_curr_node) with
| (2, 6) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest)
| (0, 6) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest)
| (3, 6) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 6
__sedlex_rest
| (4, 1) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 2
__sedlex_rest
| (3, 2) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 6
__sedlex_rest)
| (0, 10) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest
| (2, 10) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest
| _ -> assert false) in
if prev_state is sometime useful, we could include it inside a guard only when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prev_state
is still useful. It's a good idea to make it a guard.
Let me make more tests.
FYI, if one really wants to support capture groups correctly: https://re2c.org/#papers has several must-read papers. |
@pmetzger Thanks for the link! |
let r2, s2 = aux allow_alias p2 in | ||
if not (StrLocSet.equal s1 s2) then begin | ||
let x = | ||
try StrLocSet.choose (StrLocSet.diff s1 s2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid the exception based selection. and use
StrLocSet.choose StrLocSet.(diff (union s1 s2) (inter s1 s2))
I've added this in hhugo@0284bfd |
Hey there! Do we want to try & come to a consensus on wether this could be merged? |
@toots I think there's at least one code review comment above that hasn't yet been addressed? Generally I think we should merge though? |
I currently can't close/resolve threads. There are still a bunch unaddressed ones. |
This PR tries to support named capture groups. Closes #5, closes #102.
Usages
In the example
Variable
digits
is defined as a(pos, len)
pair, which can be passed as arguments toSedlexing.sub_lexeme
to retrieve the group captured bydigits
.The
as
syntax is only allowed inmatch%sedlex ... with ...
constructs. You cannot provide an alias inlet ... = [%sedlex.regexp? ...]
. And it is not allowed inside constructors (i.e.Star
,Plus
,Rep
,Opt
,Compl
,Sub
,Intersect
,Chars
).Implementation
We are trying to restore the NFA node transitions path with the information of the DFA state transitions path.
Explanation
And finally,
is translated to
Note
When there are multiple available paths, any of them might be chosen. As in
One of the possible combinations of
a
andb
will be returned.Others
There are some tests (
test/number_lexer.ml
,test/misc.ml
) currently. More to be added.