Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support named capture groups #112

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Conversation

dyzsr
Copy link

@dyzsr dyzsr commented Sep 19, 2022

This PR tries to support named capture groups. Closes #5, closes #102.

Usages

In the example

(Plus '0'..'9') as digits

Variable digits is defined as a (pos, len) pair, which can be passed as arguments to Sedlexing.sub_lexeme to retrieve the group captured by digits.

The as syntax is only allowed in match%sedlex ... with ... constructs. You cannot provide an alias in let ... = [%sedlex.regexp? ...]. And it is not allowed inside constructors (i.e. Star, Plus, Rep, Opt, Compl, Sub, Intersect, Chars).

Plus ("ab" as a)  (* this is invalid *)

Implementation

We are trying to restore the NFA node transitions path with the information of the DFA state transitions path.

Explanation

And finally,

let sub (a, b) = Sedlexing.Latin1.sub_lexeme buf a b in
match%sedlex buf with
  | (sign as s), prefix_2, (num_2 as n) ->
      Printf.printf "Bin %s%s\n" (sub s) (sub n);
      token buf
  ...

is translated to

let sub (a, b) = Sedlexing.Latin1.sub_lexeme buf a b in
match __sedlex_result with
  | 0 ->
      let __sedlex_aliases_starts, __sedlex_aliases_stops =
        __sedlex_trace_0 buf __sedlex_path
      in
      let n = (__sedlex_aliases_starts.(0), __sedlex_aliases_stops.(0) - __sedlex_aliases_starts.(0))
      and s = (__sedlex_aliases_starts.(1), __sedlex_aliases_stops.(1) - __sedlex_aliases_starts.(1)) in
      Printf.printf "Bin %s%s\n" (sub s) (sub n);
      token buf
  ...

Note

When there are multiple available paths, any of them might be chosen. As in

(Plus "ab" as a), (Plus "ab" as b)

"abababab" => a:"ababab", b:"ab"

One of the possible combinations of a and b will be returned.

Others

There are some tests (test/number_lexer.ml, test/misc.ml) currently. More to be added.

@dyzsr dyzsr marked this pull request as ready for review September 20, 2022 15:25
@dyzsr dyzsr requested review from hhugo and removed request for Drup, pmetzger and alainfrisch September 20, 2022 15:38
@dyzsr
Copy link
Author

dyzsr commented Sep 20, 2022

Can anybody help add reviewers? (Also approve running workflows)

src/syntax/ppx_sedlex.ml Outdated Show resolved Hide resolved
@dyzsr
Copy link
Author

dyzsr commented Sep 21, 2022

PTAL @alainfrisch @Drup @pmetzger

with Not_found ->
err p.ppat_loc (Printf.sprintf "unbound regexp %s" x)
end
| Ppat_alias (p, ({ txt = x } as x_loc)) when allow_alias ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give a better error for Ppat_alias _ when not allow_alias ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still stand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been addressed.

@@ -489,9 +667,13 @@ let mapper =
let l, regexps' = sub#structure_with_regexps (!previous @ l) in
let parts = List.map partition (get_partitions ()) in
let tables = List.map table (get_tables ()) in
let funcs =
let loc = default_loc in
[%str let __sedlex_tl = function _ :: tl -> tl | _ -> assert false]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this in the lib, similar to what we do for Sedlexing.__private__next_int

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not relevant anymore

let brs =
Array.of_list
(List.map
(fun ((r, s), e) -> ((r, List.of_seq (StrLocSet.to_seq s)), e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StrLocSet.elements ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been applied

Copy link
Member

@toots toots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really comment on the implementation but, for the API, is there a benefit in exporting offset/length vs simply binding the resulting match string?

@hhugo
Copy link
Contributor

hhugo commented Sep 21, 2022

Could you give more details about the current implementation ? Why is it done this way ? Can't we update positions as we traverse the automation, instead of doing a second pass based on some trace ? Why is the trace processed in reverse order ?

@hhugo
Copy link
Contributor

hhugo commented Sep 21, 2022

There is a lot of room for optimization.

Offset tracking could be shared between bindings.

For example, in(Plus 'a' as a), (Plus 'b' as b), end of binding a equals start of binding b.

Some offset can be computer statically. In the regexp above, start of a is 0, end of b is the end of the lexeme.

Have you considered such optimizations ?

@hhugo
Copy link
Contributor

hhugo commented Sep 22, 2022

let rec token buf =
  match%sedlex buf with
    | (Plus "a" as a), (Plus "b" as b), "c" ->
        ignore a;
        ignore b;
        Printf.printf "Op %s - %s\n"
          (Sedlexing.Latin1.sub_lexeme buf (fst a) (snd a))
          (Sedlexing.Latin1.sub_lexeme buf (fst b) (snd b));
        token buf
    | eof -> print_endline "EOF"
    | _ -> failwith "Unexpected character"

let () =
  let lexbuf = Sedlexing.Latin1.from_string "aaaabbbc" in
  token lexbuf

Returns an unexpected result:

Op aaaab - bb                       
EOF

@hhugo
Copy link
Contributor

hhugo commented Sep 22, 2022

I think I would be easier if your trace was maintaining pairs of pos, instead of pos + len.

@dyzsr
Copy link
Author

dyzsr commented Sep 26, 2022

@hhugo Thanks for your considerate suggestions! Let me do it one by one and start from merging the alias offsets.

ocamllex has some optimization:

rule token = parse
  | ("a"+ as a) ("b" + as b) "d" { a,b }

only stores 2 integers

rule token = parse
  | (['-' '+'] as sign) (['0'-'9']+ as a) { a,b }

Doesn't store anything.

I'm not sure if I'm able to support optimizations in such static cases. I'll think about it.

Another thing yet to do is to stop tracking the path if the matching branches have no alias at all.

(fun action offsets ->
try
let i = Hashtbl.find cases2offset offsets in
Hashtbl.replace cases2offset offsets i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Hashtbl.replace doesn't seem to be doing anything.

prev_state : int;
prev_node : int;
char_set : Sedlex_cset.t;
actions : (string * bool) list;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe declare a separate type for action ?
type action = Start of string | End of string ?

| [] -> e
| aliases ->
let loc = default_loc in
let _, action_offsets = Option.get offsets.(i) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of matching on aliases above, we could directly match on offsets.(i)

let gen_aliases lexbuf offsets i e aliases =
  match offsets.(i) with
  | None -> e
  | Some (_, action_offsets) ->
     let loc = default_loc in
     pexp_let ~loc Nonrecursive
        [
          value_binding ~loc
            ~pat:[%pat? __sedlex_offsets]
            ~expr:
              (appfun (trace_fun i) [evar ~loc lexbuf; [%expr __sedlex_path]]);
        ]
      @@ pexp_let ~loc Nonrecursive
           (List.map
             (fun { txt = alias; loc } ->
                let start = Hashtbl.find action_offsets (alias, true) in
                let stop = Hashtbl.find action_offsets (alias, false) in
               value_binding ~loc ~pat:(pvar ~loc alias)
                  ~expr:
                    [%expr
                      __sedlex_offsets.([%e eint ~loc start]),
                        __sedlex_offsets.([%e eint ~loc stop])
                        - __sedlex_offsets.([%e eint ~loc start])])
              aliases)
      @@ e

with Not_found -> Hashtbl.add action2cases action [i])
actions)
cases;
let counter = ref 0 in
Copy link
Contributor

@hhugo hhugo Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counter could be replace by Hastbl.length alias2offset I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore me

let offset_array =
value_binding ~loc
~pat:[%pat? __sedlex_offsets]
~expr:(pexp_array ~loc (List.init offsets_num (fun _ -> [%expr 0])))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use an invalid position here (e.g. '-1`) as it would make it more obvious if we ever encounter a bug where position are not updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done

let node_j = Hashtbl.find nodes_idx from_node in
let rec dfs cset actions to_node =
let node_i = Hashtbl.find nodes_idx to_node in
try ignore (Hashtbl.find cases (i, node_i, j, cset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashtbl.mem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done


let compile_traces states (start, final) =
let counter = ref 0 in
let nodes_idx = Hashtbl.create 31 in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are nodes_idx different from node.id ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I forgot there's already an id in node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

let i = Hashtbl.find nodes_idx node in
let actions = append_action actions node.action in
let cases =
if node.trans <> [] || node == final then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining these conditions ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it later.

let r1, s1 = aux allow_alias p1 in
let r2, s2 = aux allow_alias p2 in
if not (StrLocSet.equal s1 s2) then begin
let x =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about providing the full list of problematic bindings ?
(e.g StrLocSet.(diff (union s1 s2) (inter s1 s2)))

(fun (r1, s1) p ->
let r2, s2 = aux allow_alias p in
if not (StrLocSet.disjoint s1 s2) then begin
let x = StrLocSet.choose (StrLocSet.inter s1 s2) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we could return the full list of duplicated bindings

Copy link
Author

@dyzsr dyzsr Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to report one location at a time, just like the OCaml compiler behaves. (One name corresponds to one location)

(fun to_state i ->
List.iter
(fun from_node ->
try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this try-with necessary ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It caused some problem when I removed nodes_idx. Now it's fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the issue and explain what are the irrelevant nodes ?

[
pint ~loc curr_state;
pint ~loc curr_node;
pint ~loc prev_state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is prev_state ever needed to choose the action ?
On a trivial example

let rec token buf =
  match%sedlex buf with
    | (Star "a" as a), (Plus "b" as b), "c" ->
        print_alias buf "a" a;
        print_alias buf "b" b;
        token buf
    | eof -> print_endline "EOF"
    | _ -> failwith "Unexpected character"

curr_state and curr_node are enough for choosing the action.
The following

          let __sedlex_code = Sedlexing.lexeme_code buf (__sedlex_pos - 1) in
          (match (__sedlex_curr_state, __sedlex_curr_node,
                   __sedlex_prev_state)
           with
           | (2, 6, 0) ->
               (__sedlex_offsets.(2) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest)
           | (0, 6, 2) ->
               (__sedlex_offsets.(2) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest)
           | (3, 6, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 6 __sedlex_rest
           | (4, 1, 3) -> __sedlex_aux (__sedlex_pos - 1) 3 2 __sedlex_rest
           | (3, 2, 0) ->
               (__sedlex_offsets.(1) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) 0 6 __sedlex_rest)
           | (3, 6, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 6 __sedlex_rest
           | (2, 6, 2) ->
               (__sedlex_offsets.(2) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest)
           | (3, 2, 2) ->
               (__sedlex_offsets.(1) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) 2 6 __sedlex_rest)
           | (3, 2, 3) ->
               (__sedlex_offsets.(1) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) 3 6 __sedlex_rest)
           | (3, 6, 3) -> __sedlex_aux (__sedlex_pos - 1) 3 6 __sedlex_rest
           | (0, 10, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest
           | (0, 6, 0) ->
               (__sedlex_offsets.(2) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest)
           | (2, 10, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest
           | (0, 10, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest
           | (2, 10, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest
           | _ -> assert false) in

can be rewritten into

          let __sedlex_code = Sedlexing.lexeme_code buf (__sedlex_pos - 1) in
          (match (__sedlex_curr_state, __sedlex_curr_node) with
           | (2, 6) ->
               (__sedlex_offsets.(2) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
                  __sedlex_rest)
           | (0, 6) ->
               (__sedlex_offsets.(2) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
                  __sedlex_rest)
           | (3, 6) ->
               __sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 6
                 __sedlex_rest
           | (4, 1) ->
               __sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 2
                 __sedlex_rest
           | (3, 2) ->
               (__sedlex_offsets.(1) <- __sedlex_pos;
                __sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 6
                  __sedlex_rest)
           | (0, 10) ->
               __sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
                 __sedlex_rest
           | (2, 10) ->
               __sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
                 __sedlex_rest
           | _ -> assert false) in

if prev_state is sometime useful, we could include it inside a guard only when needed.

Copy link
Author

@dyzsr dyzsr Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prev_state is still useful. It's a good idea to make it a guard.

Let me make more tests.

@pmetzger
Copy link
Member

FYI, if one really wants to support capture groups correctly: https://re2c.org/#papers has several must-read papers.

@dyzsr
Copy link
Author

dyzsr commented Sep 29, 2022

@pmetzger Thanks for the link!

let r2, s2 = aux allow_alias p2 in
if not (StrLocSet.equal s1 s2) then begin
let x =
try StrLocSet.choose (StrLocSet.diff s1 s2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid the exception based selection. and use
StrLocSet.choose StrLocSet.(diff (union s1 s2) (inter s1 s2))

@hhugo
Copy link
Contributor

hhugo commented Oct 4, 2022

I think it would be nice to commit the post-processed version of your test (like you did before) with the corresponding dune stanza so that the file is kept up to date automatically. I think this really helps when reviewing changes.

I've added this in hhugo@0284bfd

@toots
Copy link
Member

toots commented Oct 22, 2022

Hey there! Do we want to try & come to a consensus on wether this could be merged?

@pmetzger
Copy link
Member

@toots I think there's at least one code review comment above that hasn't yet been addressed? Generally I think we should merge though?

@hhugo
Copy link
Contributor

hhugo commented Oct 24, 2022

I currently can't close/resolve threads. There are still a bunch unaddressed ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support of as Named capture groups
4 participants