New, optional flex-based tokenizer #218

bzz · 2019-04-08T15:55:02Z

This change partially address #193

It follows option 3 from #193 (comment)

a new native tokenizer implementation, hidden behind -tags flex
existing native code in C from Linguist project, generated by Flex from tokenizer.l is added under internal/tokenizer/flex/lex.linguist_yy.{h,c}
tokenizer_c.go is a manual transliteration to Go of the parser from linguist.c

The benchmark before:

$ go test -bench=. -run=Benchmark. ./internal/tokenizer
BenchmarkTokenizer_BaselineCopy-4   	 5000000	       304 ns/op	    1536 B/op	       1 allocs/op
BenchmarkTokenizer-4                	    2000	    672592 ns/op	   58205 B/op	     435 allocs/op

after

$ go test -bench=. -run=Benchmark. -tags flex ./internal/tokenizer
BenchmarkTokenizer_BaselineCopy-4   	 5000000	       300 ns/op	    1536 B/op	       1 allocs/op
BenchmarkTokenizer-4                	   20000	     69302 ns/op	    5064 B/op	     150 allocs/op

so it's x10 faster.

But because it's a cgo and it complicates the build I suggest to merge it, keeping behind the tag.

Right now this is only useful for debugging #194 and is ready to be merged without affecting the releases.

But if we agree on it's value for downstream uses for performance reasons, some extra work is going to be needs (in subsequent PR: documentation + updated classifier assets, based on this tokenizer that do not overwrite the default one) before allowing it's usage that goes beyond the internal dev debugging workflow. It is going to be tracked under #193

internal/tokenizer/tokenize_test.go

internal/tokenizer/flex/tokenize_c.go

internal/tokenizer/tokenize_c.go

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-04-14T20:20:49Z

@creachadair all feedback addressed, ready for another round.

creachadair

A couple minor and optional suggestions, otherwise looks good.

internal/tokenizer/tokenize.go

internal/tokenizer/tokenize_c.go

dennwc · 2019-04-16T16:53:15Z

internal/tokenizer/flex/tokenize_c.go

+	C.linguist_yylex_init_extra(&extra, &scanner)
+	buf = C.linguist_yy_scan_bytes((*C.char)(cs), _len, scanner)
+
+	ary := []string{}


Suggested change

ary := []string{}

var ary []string

source

This would result in returning nil instead of empty slice sometimes and although I'm aware of https://github.com/golang/go/wiki/CodeReviewComments#declaring-empty-slices that would be inconsistent with how current impl of tokenizer API works.

Thats why I decided to do it this way and keep them the same.

I'm not sure if it matters from the API point of view - it's pretty rare to test for slice nil-ness.

internal/tokenizer/tokenize_test.go

dennwc · 2019-04-16T16:55:56Z

internal/tokenizer/flex/tokenize_c.go

+// TokenizeFlex implements tokenizer by calling Flex generated code from linguist in C
+// This is a transliteration from C https://github.com/github/linguist/blob/master/ext/linguist/linguist.c#L12
+func TokenizeFlex(content []byte) []string {
+	var buf C.YY_BUFFER_STATE


Merge into a single var block

I know what you mean, but would prefer to keep it as close to original as possible as this is almost line-by-line transliteration from C

The key word is "almost" :) Unless it's machine-generated, I think it makes sense to adhere to Go conventions where possible. It's not a bit change anyway. Still, it's optional, so feel free to ignore.

dennwc · 2019-04-16T16:55:59Z

internal/tokenizer/flex/tokenize_c.go

+import "unsafe"
+
+const maxTokenLen = 32 // bytes
+


I would also propose to include a go:generate directive that updates the C tokenizer, if it makes sense.

I think I might not communicate it well enough, but just to make it clear - the only code that is "generated" in this PR is the C lex.linguist_yy.{h,c}, that is vendored inside Linguist and we are using exact copy.

In the long run, if this tokeniser turns out to be useful beyond classifier accuracy debugging - it does make sense to add C code generation script from the flex grammar, but I also think it's ok to do that only after #193 is done (so it adds value, not only a maintenance burden).

Sorry, I haven't meant to regenerate it. go:generate may also be used to wget the latest source, for example. Not critical, just a thought.

bzz · 2019-04-16T17:28:26Z

@dennwc all feedback addressed, ready for another round.

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz force-pushed the tokenizer-flex-cgo branch from 3072e82 to 8c1d1a6 Compare April 8, 2019 15:55

bzz requested review from creachadair and dennwc April 8, 2019 16:02

bzz self-assigned this Apr 8, 2019

bzz force-pushed the tokenizer-flex-cgo branch from 8c1d1a6 to 682f026 Compare April 8, 2019 16:05

creachadair reviewed Apr 8, 2019

View reviewed changes

bzz mentioned this pull request Apr 12, 2019

Different tokenisation results between oniguruma and RE2 #225

Closed

bzz added this to the v2.0.0 milestone Apr 14, 2019

bzz changed the title ~~New, flex-based tokenizer option proposal~~ New, optional flex-based tokenizer Apr 14, 2019

bzz added 3 commits April 14, 2019 21:38

tokenizer: port flex-based C impl from linguist

553399e

Signed-off-by: Alexander Bezzubov <[email protected]>

refactor to build tags

8756fbd

Signed-off-by: Alexander Bezzubov <[email protected]>

tokenizer: cleanup & attributions

7929933

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz force-pushed the tokenizer-flex-cgo branch from 682f026 to 7929933 Compare April 14, 2019 19:38

creachadair approved these changes Apr 15, 2019

View reviewed changes

internal/tokenizer/tokenize.go Outdated Show resolved Hide resolved

internal/tokenizer/tokenize_c.go Show resolved Hide resolved

creachadair approved these changes Apr 16, 2019

View reviewed changes

dennwc reviewed Apr 16, 2019

View reviewed changes

bzz added 3 commits April 16, 2019 19:38

address review feedback

ada6f15

Signed-off-by: Alexander Bezzubov <[email protected]>

doc: improve API doc on review feedback

6c7b91c

Signed-off-by: Alexander Bezzubov <[email protected]>

test: don't export tokenizer fixtures

7e136ba

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz force-pushed the tokenizer-flex-cgo branch from cbb4c68 to 7e136ba Compare April 16, 2019 17:38

dennwc approved these changes Apr 16, 2019

View reviewed changes

bzz merged commit b6daf5c into src-d:master Apr 17, 2019

bzz deleted the tokenizer-flex-cgo branch April 17, 2019 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New, optional flex-based tokenizer #218

New, optional flex-based tokenizer #218

bzz commented Apr 8, 2019 •

edited

Loading

bzz commented Apr 14, 2019

creachadair left a comment

dennwc Apr 16, 2019

bzz Apr 16, 2019 •

edited

Loading

dennwc Apr 16, 2019

dennwc Apr 16, 2019

bzz Apr 16, 2019

dennwc Apr 16, 2019

dennwc Apr 16, 2019

bzz Apr 16, 2019 •

edited

Loading

dennwc Apr 16, 2019

bzz commented Apr 16, 2019

New, optional flex-based tokenizer #218

New, optional flex-based tokenizer #218

Conversation

bzz commented Apr 8, 2019 • edited Loading

bzz commented Apr 14, 2019

creachadair left a comment

Choose a reason for hiding this comment

dennwc Apr 16, 2019

Choose a reason for hiding this comment

bzz Apr 16, 2019 • edited Loading

Choose a reason for hiding this comment

dennwc Apr 16, 2019

Choose a reason for hiding this comment

dennwc Apr 16, 2019

Choose a reason for hiding this comment

bzz Apr 16, 2019

Choose a reason for hiding this comment

dennwc Apr 16, 2019

Choose a reason for hiding this comment

dennwc Apr 16, 2019

Choose a reason for hiding this comment

bzz Apr 16, 2019 • edited Loading

Choose a reason for hiding this comment

dennwc Apr 16, 2019

Choose a reason for hiding this comment

bzz commented Apr 16, 2019

bzz commented Apr 8, 2019 •

edited

Loading

bzz Apr 16, 2019 •

edited

Loading

bzz Apr 16, 2019 •

edited

Loading