Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync to linguist 7.2.0: heuristics.yml support #189

Merged
merged 27 commits into from
Feb 14, 2019
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9c8053f
gen: small refactoring of generator.Documentation
bzz Dec 28, 2018
ac13a24
generator: godoc update
bzz Dec 28, 2018
d5b665b
gen: initial version of new content heuristics
bzz Jan 9, 2019
6fd4849
gen: skip&report unsupported regexp syntax
bzz Jan 4, 2019
bbc27d5
sync \w Github Linguist v7.1.3
bzz Jan 8, 2019
5505ed2
gen: adjust Ruby regexp syntax to RE2
bzz Jan 9, 2019
3051773
gen: same alias/language name lookup as in linguist
bzz Jan 9, 2019
dceb95a
gen: fix regexp Or syntax
bzz Jan 9, 2019
df7844e
test: add test for content heuristics edge cases
bzz Jan 9, 2019
6472714
gen: fix aliases case
bzz Jan 10, 2019
43d1c6d
doc: update to currenet state
bzz Jan 28, 2019
33bb5a6
test: update to expect names + .gold generation
bzz Jan 28, 2019
2049da9
test: update *.gold results
bzz Jan 28, 2019
63f3661
Apply suggestions from code review
creachadair Jan 28, 2019
5fbadc8
Address review feedback
bzz Jan 28, 2019
65b9545
test: fix nasty 'go test' exit code 0 on failing tests :/
bzz Jan 29, 2019
ef9311e
tests: document edge case and fix the tests
bzz Jan 29, 2019
fb61eaa
gen: add gold test restuls generation + docs sync better
bzz Jan 29, 2019
c57bc4a
gen: add missing GoDoc
bzz Jan 29, 2019
c4f3dbe
cleanup, addressing code review feedback
bzz Jan 29, 2019
97ab29a
heuristics: refactoring, extracting rule package
bzz Feb 5, 2019
ec54891
review: remove indirection + expose Heuristic
bzz Feb 13, 2019
191aa8c
gen: update generated code
bzz Feb 13, 2019
6d601d6
review: do not export pattens
bzz Feb 13, 2019
bff3a15
review: nuke regexp dependency
bzz Feb 13, 2019
3872cbd
rule: small godoc cleanup
bzz Feb 14, 2019
fc48fc4
sync \w linguist v7.2.0
bzz Feb 14, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# source{d} Contributing Guidelines
bzz marked this conversation as resolved.
Show resolved Hide resolved

source{d} projects accept contributions via GitHub pull requests.
This document outlines some of the
conventions on development workflow, commit message formatting, contact points,
and other resources to make it easier to get your contribution accepted.

## Certificate of Origin

By contributing to this project, you agree to the [Developer Certificate of
Origin (DCO)](DCO). This document was created by the Linux Kernel community and is a
simple statement that you, as a contributor, have the legal right to make the
contribution.

In order to show your agreement with the DCO you should include at the end of the commit message,
the following line: `Signed-off-by: John Doe <[email protected]>`, using your real name.

This can be done easily using the [`-s`](https://github.com/git/git/blob/b2c150d3aa82f6583b9aadfecc5f8fa1c74aca09/Documentation/git-commit.txt#L154-L161) flag on the `git commit`.

If you find yourself pushed a few commits without `Signed-off-by`, you can still add it afterwards. We wrote a manual which can help: [fix-DCO.md](https://github.com/src-d/guide/blob/master/developer-community/fix-DCO.md).

## Support Channels

The official support channels, for both users and contributors, are:

- GitHub issues: each repository has its own list of issues.
- Slack: join the [source{d} Slack](https://join.slack.com/t/sourced-community/shared_invite/enQtMjc4Njk5MzEyNzM2LTFjNzY4NjEwZGEwMzRiNTM4MzRlMzQ4MmIzZjkwZmZlM2NjODUxZmJjNDI1OTcxNDAyMmZlNmFjODZlNTg0YWM) community.

*Before opening a new issue or submitting a new pull request, it's helpful to
search the project - it's likely that another user has already reported the
issue you're facing, or it's a known issue that we're already aware of.


## How to Contribute

Pull Requests (PRs) are the main and exclusive way to contribute code to source{d} projects.
In order for a PR to be accepted it needs to pass this list of requirements:

- The contribution must be correctly explained with natural language and providing a minimum working example that reproduces it.
- All PRs must be written idiomaticly:
- for Go: formatted according to [gofmt](https://golang.org/cmd/gofmt/), and without any warnings from [go lint](https://github.com/golang/lint) nor [go vet](https://golang.org/cmd/vet/)
- for other languages, similar constraints apply.
- They should in general include tests, and those shall pass.
- If the PR is a bug fix, it has to include a new unit test that fails before the patch is merged.
- If the PR is a new feature, it has to come with a suite of unit tests, that tests the new functionality.
- In any case, all the PRs have to pass the personal evaluation of at least one of the [maintainers](MAINTAINERS) of the project.


### Format of the commit message

Every commit message should describe what was changed, under which context and, if applicable, the GitHub issue it relates to:

```
plumbing: packp, Skip argument validations for unknown capabilities. Fixes #623
```

The format can be described more formally as follows:

```
<package>: <subpackage>, <what changed>. [Fixes #<issue-number>]
```
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ clean: clean-linguist clean-shared
code-generate: $(LINGUIST_PATH)
mkdir -p data && \
go run internal/code-generator/main.go
ENRY_TEST_REPO="$${PWD}/.linguist" go test -v \
-run Test_GeneratorTestSuite \
./internal/code-generator/generator \
-testify.m TestUpdateGeneratorTestSuiteGold \
-update_gold

benchmarks: $(LINGUIST_PATH)
go test -run=NONE -bench=. && \
Expand Down
19 changes: 8 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,14 +154,15 @@ Generated Java bindings using a C-shared library and JNI are located under [`jav
Development
------------

*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate the necessary code you must run:
*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate all the necessary code you must run:

git clone https://github.com/github/linguist.git .linguist
bzz marked this conversation as resolved.
Show resolved Hide resolved
go generate

We update enry when changes are done in linguist's master branch on the following files:

* [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml)
* [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb)
* [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml)
* [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml)
* [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml)

Expand All @@ -183,17 +184,13 @@ Divergences from linguist
Using [linguist/samples](https://github.com/github/linguist/tree/master/samples)
as a set for the tests, the following issues were found:

* With [hello.ms](https://github.com/github/linguist/blob/master/samples/Unix%20Assembly/hello.ms) we can't detect the language (Unix Assembly) because we don't have a matcher in contentMatchers (content.go) for Unix Assembly. Linguist uses this [regexp](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L300) in its code,
* [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine

`elsif /(?<!\S)\.(include|globa?l)\s/.match(data) || /(?<!\/\*)(\A|\n)\s*\.[A-Za-z][_A-Za-z0-9]*:/.match(data.gsub(/"([^\\"]|\\.)*"|'([^\\']|\\.)*'|\\\s*(?:--.*)?\n/, ""))`
* Heuristics for ".ice" extension can not be parsed due to a bug in Regex syntax upstream [github/linguist#4376](https://github.com/github/linguist/pull/4376)

which we can't port.
* As of (Linguist v5.3.2)[https://github.com/github/linguist/releases/tag/v5.3.2] it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry stil uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. Tracked under https://github.com/src-d/enry/issues/193

* All files for the SQL language fall to the classifier because we don't parse
this [disambiguator
expression](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L433)
for `*.sql` files right. This expression doesn't comply with the pattern for the
rest in [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb).
* Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL". Tracked under https://github.com/src-d/enry/issues/194

`enry` [CLI tool](#cli) does not require a full Git repository to be present in filesystem in order to report languages.

Expand Down Expand Up @@ -232,7 +229,7 @@ As benchmarks depend on Ruby and Github-Linguist gem make sure you have:
If you want to reproduce the same benchmarks as reported above:
- Make sure all [dependencies](#benchmark-dependencies) are installed
- Install [gnuplot](http://gnuplot.info) (in order to plot the histogram)
- Run `ENRY_TEST_REPO=.linguist benchmarks/run.sh` (takes ~15h)
- Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h)

It will run the benchmarks for enry and linguist, parse the output, create csv files and plot the histogram. This takes some time.

Expand Down
5 changes: 1 addition & 4 deletions benchmark_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,6 @@ var (
)

func TestMain(m *testing.M) {
var exitCode int
defer os.Exit(exitCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also use anonymous function to still have defer:

defer func(){
  os.Exit(exitCode)
}()

But in this case it's worth setting exitCode = 1 by default. Sounds more complex than current approach, but wanted to mention it just in case :)


flag.BoolVar(&slow, "slow", false, "run benchmarks per sample for strategies too")
flag.Parse()

Expand All @@ -47,7 +44,7 @@ func TestMain(m *testing.M) {
log.Fatal(err)
}

exitCode = m.Run()
os.Exit(m.Run())
}

func cloneLinguist(linguistURL string) error {
Expand Down
11 changes: 5 additions & 6 deletions common.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ const OtherLanguage = ""
// Strategy type fix the signature for the functions that can be used as a strategy.
type Strategy func(filename string, content []byte, candidates []string) (languages []string)

// DefaultStrategies is the strategies' sequence GetLanguage uses to detect languages.
// DefaultStrategies is a sequence of strategies used by GetLanguage to detect languages.
var DefaultStrategies = []Strategy{
GetLanguagesByModeline,
GetLanguagesByFilename,
Expand Down Expand Up @@ -396,12 +396,13 @@ func GetLanguagesByContent(filename string, content []byte, _ []string) []string
}

ext := strings.ToLower(filepath.Ext(filename))
fnMatcher, ok := data.ContentMatchers[ext]

heuristic, ok := data.ContentHeuristics[ext]
if !ok {
return nil
}

return fnMatcher(content)
return heuristic.Match(content)
}

// GetLanguagesByClassifier uses DefaultClassifier as a Classifier and returns a sorted slice of possible languages ordered by
Expand Down Expand Up @@ -454,9 +455,7 @@ func GetLanguageType(language string) (langType Type) {
// GetLanguageByAlias returns either the language related to the given alias and ok set to true
// or Otherlanguage and ok set to false if the alias is not recognized.
func GetLanguageByAlias(alias string) (lang string, ok bool) {
a := strings.Split(alias, `,`)[0]
a = strings.ToLower(a)
lang, ok = data.LanguagesByAlias[a]
lang, ok = data.LanguageByAlias(alias)
if !ok {
lang = OtherLanguage
}
Expand Down
75 changes: 51 additions & 24 deletions common_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
"gopkg.in/src-d/enry.v1/data"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/stretchr/testify/suite"
)

Expand All @@ -19,9 +20,36 @@ const linguistClonedEnvVar = "ENRY_TEST_REPO"

type EnryTestSuite struct {
suite.Suite
repoLinguist string
samplesDir string
cloned bool
tmpLinguist string
needToClone bool
samplesDir string
}

func (s *EnryTestSuite) TestRegexpEdgeCases() {
var regexpEdgeCases = []struct {
lang string
filename string
}{
{lang: "ActionScript", filename: "FooBar.as"},
{lang: "Forth", filename: "asm.fr"},
{lang: "X PixMap", filename: "cc-public_domain_mark_white.pm"},
//{lang: "SQL", filename: "drop_stuff.sql"}, // https://github.com/src-d/enry/issues/194
{lang: "Fstar", filename: "Hacl.Spec.Bignum.Fmul.fst"},
{lang: "C++", filename: "Types.h"},
}

for _, r := range regexpEdgeCases {
filename := fmt.Sprintf("%s/samples/%s/%s", s.tmpLinguist, r.lang, r.filename)

content, err := ioutil.ReadFile(filename)
require.NoError(s.T(), err)

lang := GetLanguage(r.filename, content)
s.T().Logf("File:%s, lang:%s", filename, lang)

expLang, _ := data.LanguageByAlias(r.lang)
require.EqualValues(s.T(), expLang, lang)
}
}

func Test_EnryTestSuite(t *testing.T) {
Expand All @@ -30,25 +58,24 @@ func Test_EnryTestSuite(t *testing.T) {

func (s *EnryTestSuite) SetupSuite() {
var err error
s.repoLinguist = os.Getenv(linguistClonedEnvVar)
s.cloned = s.repoLinguist == ""
if s.cloned {
s.repoLinguist, err = ioutil.TempDir("", "linguist-")
assert.NoError(s.T(), err)
}

s.samplesDir = filepath.Join(s.repoLinguist, "samples")

if s.cloned {
cmd := exec.Command("git", "clone", linguistURL, s.repoLinguist)
s.tmpLinguist = os.Getenv(linguistClonedEnvVar)
s.needToClone = s.tmpLinguist == ""
if s.needToClone {
s.tmpLinguist, err = ioutil.TempDir("", "linguist-")
require.NoError(s.T(), err)
s.T().Logf("Cloning Linguist repo to '%s' as %s was not set\n",
s.tmpLinguist, linguistClonedEnvVar)
cmd := exec.Command("git", "clone", linguistURL, s.tmpLinguist)
err = cmd.Run()
assert.NoError(s.T(), err)
require.NoError(s.T(), err)
}
s.samplesDir = filepath.Join(s.tmpLinguist, "samples")
s.T().Logf("using samples from %s", s.samplesDir)

cwd, err := os.Getwd()
assert.NoError(s.T(), err)

err = os.Chdir(s.repoLinguist)
err = os.Chdir(s.tmpLinguist)
assert.NoError(s.T(), err)

cmd := exec.Command("git", "checkout", data.LinguistCommit)
Expand All @@ -60,8 +87,8 @@ func (s *EnryTestSuite) SetupSuite() {
}

func (s *EnryTestSuite) TearDownSuite() {
if s.cloned {
err := os.RemoveAll(s.repoLinguist)
if s.needToClone {
err := os.RemoveAll(s.tmpLinguist)
assert.NoError(s.T(), err)
}
}
Expand All @@ -88,7 +115,7 @@ func (s *EnryTestSuite) TestGetLanguage() {
}

func (s *EnryTestSuite) TestGetLanguagesByModelineLinguist() {
var modelinesDir = filepath.Join(s.repoLinguist, "test/fixtures/Data/Modelines")
var modelinesDir = filepath.Join(s.tmpLinguist, "test/fixtures/Data/Modelines")

tests := []struct {
name string
Expand Down Expand Up @@ -400,15 +427,16 @@ func (s *EnryTestSuite) TestGetLanguageByAlias() {
func (s *EnryTestSuite) TestLinguistCorpus() {
const filenamesDir = "filenames"
var cornerCases = map[string]bool{
"hello.ms": true,
"drop_stuff.sql": true, // https://github.com/src-d/enry/issues/194
// .es and .ice fail heuristics parsing, but do not fail any tests
}

var total, failed, ok, other int
var expected string
filepath.Walk(s.samplesDir, func(path string, f os.FileInfo, err error) error {
if f.IsDir() {
if f.Name() != filenamesDir {
expected = f.Name()
expected, _ = data.LanguageByAlias(f.Name())
}

return nil
Expand All @@ -431,17 +459,16 @@ func (s *EnryTestSuite) TestLinguistCorpus() {
} else {
status = "failed"
failed++

}

if _, ok := cornerCases[filename]; ok {
fmt.Printf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
s.T().Logf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
} else {
assert.Equal(s.T(), expected, obtained, fmt.Sprintf("%s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status))
}

return nil
})

fmt.Printf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
s.T().Logf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
}
Loading