Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New, optional flex-based tokenizer #218

Merged
merged 6 commits into from
Apr 17, 2019
Merged

Conversation

bzz
Copy link
Contributor

@bzz bzz commented Apr 8, 2019

This change partially address #193

It follows option 3 from #193 (comment)

  • a new native tokenizer implementation, hidden behind -tags flex
  • existing native code in C from Linguist project, generated by Flex from tokenizer.l is added under internal/tokenizer/flex/lex.linguist_yy.{h,c}
  • tokenizer_c.go is a manual transliteration to Go of the parser from linguist.c

The benchmark before:

$ go test -bench=. -run=Benchmark. ./internal/tokenizer
BenchmarkTokenizer_BaselineCopy-4   	 5000000	       304 ns/op	    1536 B/op	       1 allocs/op
BenchmarkTokenizer-4                	    2000	    672592 ns/op	   58205 B/op	     435 allocs/op

after

$ go test -bench=. -run=Benchmark. -tags flex ./internal/tokenizer
BenchmarkTokenizer_BaselineCopy-4   	 5000000	       300 ns/op	    1536 B/op	       1 allocs/op
BenchmarkTokenizer-4                	   20000	     69302 ns/op	    5064 B/op	     150 allocs/op

so it's x10 faster.

But because it's a cgo and it complicates the build I suggest to merge it, keeping behind the tag.

Right now this is only useful for debugging #194 and is ready to be merged without affecting the releases.

But if we agree on it's value for downstream uses for performance reasons, some extra work is going to be needs (in subsequent PR: documentation + updated classifier assets, based on this tokenizer that do not overwrite the default one) before allowing it's usage that goes beyond the internal dev debugging workflow. It is going to be tracked under #193

@bzz bzz force-pushed the tokenizer-flex-cgo branch from 3072e82 to 8c1d1a6 Compare April 8, 2019 15:55
@bzz bzz requested review from creachadair and dennwc April 8, 2019 16:02
@bzz bzz self-assigned this Apr 8, 2019
@bzz bzz force-pushed the tokenizer-flex-cgo branch from 8c1d1a6 to 682f026 Compare April 8, 2019 16:05
internal/tokenizer/tokenize_test.go Outdated Show resolved Hide resolved
internal/tokenizer/flex/tokenize_c.go Show resolved Hide resolved
internal/tokenizer/flex/tokenize_c.go Outdated Show resolved Hide resolved
internal/tokenizer/flex/tokenize_c.go Outdated Show resolved Hide resolved
internal/tokenizer/tokenize_c.go Outdated Show resolved Hide resolved
@bzz bzz added this to the v2.0.0 milestone Apr 14, 2019
@bzz bzz changed the title New, flex-based tokenizer option proposal New, optional flex-based tokenizer Apr 14, 2019
bzz added 3 commits April 14, 2019 21:38
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
@bzz bzz force-pushed the tokenizer-flex-cgo branch from 682f026 to 7929933 Compare April 14, 2019 19:38
@bzz
Copy link
Contributor Author

bzz commented Apr 14, 2019

@creachadair all feedback addressed, ready for another round.

Copy link
Contributor

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor and optional suggestions, otherwise looks good.

internal/tokenizer/tokenize.go Outdated Show resolved Hide resolved
internal/tokenizer/tokenize_c.go Show resolved Hide resolved
C.linguist_yylex_init_extra(&extra, &scanner)
buf = C.linguist_yy_scan_bytes((*C.char)(cs), _len, scanner)

ary := []string{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ary := []string{}
var ary []string

source

Copy link
Contributor Author

@bzz bzz Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would result in returning nil instead of empty slice sometimes and although I'm aware of https://github.com/golang/go/wiki/CodeReviewComments#declaring-empty-slices that would be inconsistent with how current impl of tokenizer API works.

Thats why I decided to do it this way and keep them the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it matters from the API point of view - it's pretty rare to test for slice nil-ness.

internal/tokenizer/tokenize_test.go Outdated Show resolved Hide resolved
// TokenizeFlex implements tokenizer by calling Flex generated code from linguist in C
// This is a transliteration from C https://github.com/github/linguist/blob/master/ext/linguist/linguist.c#L12
func TokenizeFlex(content []byte) []string {
var buf C.YY_BUFFER_STATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge into a single var block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean, but would prefer to keep it as close to original as possible as this is almost line-by-line transliteration from C

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key word is "almost" :) Unless it's machine-generated, I think it makes sense to adhere to Go conventions where possible. It's not a bit change anyway. Still, it's optional, so feel free to ignore.

import "unsafe"

const maxTokenLen = 32 // bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also propose to include a go:generate directive that updates the C tokenizer, if it makes sense.

Copy link
Contributor Author

@bzz bzz Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might not communicate it well enough, but just to make it clear - the only code that is "generated" in this PR is the C lex.linguist_yy.{h,c}, that is vendored inside Linguist and we are using exact copy.

In the long run, if this tokeniser turns out to be useful beyond classifier accuracy debugging - it does make sense to add C code generation script from the flex grammar, but I also think it's ok to do that only after #193 is done (so it adds value, not only a maintenance burden).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I haven't meant to regenerate it. go:generate may also be used to wget the latest source, for example. Not critical, just a thought.

@bzz
Copy link
Contributor Author

bzz commented Apr 16, 2019

@dennwc all feedback addressed, ready for another round.

bzz added 3 commits April 16, 2019 19:38
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
@bzz bzz force-pushed the tokenizer-flex-cgo branch from cbb4c68 to 7e136ba Compare April 16, 2019 17:38
@bzz bzz merged commit b6daf5c into src-d:master Apr 17, 2019
@bzz bzz deleted the tokenizer-flex-cgo branch April 17, 2019 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants