src-d · bzz · Feb 14, 2019 · Dec 28, 2018 · Dec 28, 2018 · Jan 9, 2019
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,61 @@
+# source{d} Contributing Guidelines
+
+source{d} projects accept contributions via GitHub pull requests.
+This document outlines some of the
+conventions on development workflow, commit message formatting, contact points,
+and other resources to make it easier to get your contribution accepted.
+
+## Certificate of Origin
+
+By contributing to this project, you agree to the [Developer Certificate of
+Origin (DCO)](DCO). This document was created by the Linux Kernel community and is a
+simple statement that you, as a contributor, have the legal right to make the
+contribution.
+
+In order to show your agreement with the DCO you should include at the end of the commit message,
+the following line: `Signed-off-by: John Doe <[email protected]>`, using your real name.
+
+This can be done easily using the [`-s`](https://github.com/git/git/blob/b2c150d3aa82f6583b9aadfecc5f8fa1c74aca09/Documentation/git-commit.txt#L154-L161) flag on the `git commit`.
+
+If you find yourself pushed a few commits without `Signed-off-by`, you can still add it afterwards. We wrote a manual which can help: [fix-DCO.md](https://github.com/src-d/guide/blob/master/developer-community/fix-DCO.md).
+
+## Support Channels
+
+The official support channels, for both users and contributors, are:
+
+- GitHub issues: each repository has its own list of issues.
+- Slack: join the [source{d} Slack](https://join.slack.com/t/sourced-community/shared_invite/enQtMjc4Njk5MzEyNzM2LTFjNzY4NjEwZGEwMzRiNTM4MzRlMzQ4MmIzZjkwZmZlM2NjODUxZmJjNDI1OTcxNDAyMmZlNmFjODZlNTg0YWM) community.
+
+*Before opening a new issue or submitting a new pull request, it's helpful to
+search the project - it's likely that another user has already reported the
+issue you're facing, or it's a known issue that we're already aware of.
+
+
+## How to Contribute
+
+Pull Requests (PRs) are the main and exclusive way to contribute code to source{d} projects.
+In order for a PR to be accepted it needs to pass this list of requirements:
+
+- The contribution must be correctly explained with natural language and providing a minimum working example that reproduces it.
+- All PRs must be written idiomaticly:
+    - for Go: formatted according to [gofmt](https://golang.org/cmd/gofmt/), and without any warnings from [go lint](https://github.com/golang/lint) nor [go vet](https://golang.org/cmd/vet/)
+    - for other languages, similar constraints apply.
+- They should in general include tests, and those shall pass.
+    - If the PR is a bug fix, it has to include a new unit test that fails before the patch is merged.
+    - If the PR is a new feature, it has to come with a suite of unit tests, that tests the new functionality.
+    - In any case, all the PRs have to pass the personal evaluation of at least one of the [maintainers](MAINTAINERS) of the project.
+
+
+### Format of the commit message
+
+Every commit message should describe what was changed, under which context and, if applicable, the GitHub issue it relates to:
+
+```
+plumbing: packp, Skip argument validations for unknown capabilities. Fixes #623
+```
+
+The format can be described more formally as follows:
+
+```
+<package>: <subpackage>, <what changed>. [Fixes #<issue-number>]
+```
diff --git a/Makefile b/Makefile
@@ -46,6 +46,11 @@ clean: clean-linguist clean-shared
 code-generate: $(LINGUIST_PATH)
 	mkdir -p data && \
 	go run internal/code-generator/main.go
+	ENRY_TEST_REPO="$${PWD}/.linguist" go test  -v \
+		-run Test_GeneratorTestSuite \
+		./internal/code-generator/generator \
+		-testify.m TestUpdateGeneratorTestSuiteGold \
+		-update_gold
 
 benchmarks: $(LINGUIST_PATH)
 	go test -run=NONE -bench=. && \

diff --git a/README.md b/README.md
@@ -154,14 +154,15 @@ Generated Java bindings using a C-shared library and JNI are located under [`jav
 Development
 ------------
 
-*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate the necessary code you must run:
+*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate all the necessary code you must run:
 
+    git clone https://github.com/github/linguist.git .linguist
     go generate
 
 We update enry when changes are done in linguist's master branch on the following files:
 
 * [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml)
-* [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb)
+* [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml)
 * [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml)
 * [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml)
 
@@ -183,17 +184,13 @@ Divergences from linguist
 Using [linguist/samples](https://github.com/github/linguist/tree/master/samples)
 as a set for the tests, the following issues were found:
 
-* With [hello.ms](https://github.com/github/linguist/blob/master/samples/Unix%20Assembly/hello.ms) we can't detect the language (Unix Assembly) because we don't have a matcher in contentMatchers (content.go) for Unix Assembly. Linguist uses this [regexp](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L300) in its code,
+* [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine
 
-    `elsif /(?<!\S)\.(include|globa?l)\s/.match(data) || /(?<!\/\*)(\A|\n)\s*\.[A-Za-z][_A-Za-z0-9]*:/.match(data.gsub(/"([^\\"]|\\.)*"|'([^\\']|\\.)*'|\\\s*(?:--.*)?\n/, ""))`
+* Heuristics for ".ice" extension can not be parsed due to a bug in Regex syntax upstream [github/linguist#4376](https://github.com/github/linguist/pull/4376)
 
-    which we can't port.
+* As of (Linguist v5.3.2)[https://github.com/github/linguist/releases/tag/v5.3.2] it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry stil uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. Tracked under https://github.com/src-d/enry/issues/193
 
-* All files for the SQL language fall to the classifier because we don't parse
-this [disambiguator
-expression](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L433)
-for `*.sql` files right. This expression doesn't comply with the pattern for the
-rest in [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb).
+* Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL". Tracked under https://github.com/src-d/enry/issues/194
 
 `enry` [CLI tool](#cli) does not require a full Git repository to be present in filesystem in order to report languages.
 
@@ -232,7 +229,7 @@ As benchmarks depend on Ruby and Github-Linguist gem make sure you have:
 If you want to reproduce the same benchmarks as reported above:
  - Make sure all [dependencies](#benchmark-dependencies) are installed
  - Install [gnuplot](http://gnuplot.info) (in order to plot the histogram)
- - Run `ENRY_TEST_REPO=.linguist benchmarks/run.sh` (takes ~15h)
+ - Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h)
 
 It will run the benchmarks for enry and linguist, parse the output, create csv files and plot the histogram. This takes some time.
 

diff --git a/benchmark_test.go b/benchmark_test.go
@@ -28,9 +28,6 @@ var (
 )
 
 func TestMain(m *testing.M) {
-	var exitCode int
-	defer os.Exit(exitCode)
-
 	flag.BoolVar(&slow, "slow", false, "run benchmarks per sample for strategies too")
 	flag.Parse()
 
@@ -47,7 +44,7 @@ func TestMain(m *testing.M) {
 		log.Fatal(err)
 	}
 
-	exitCode = m.Run()
+	os.Exit(m.Run())
 }
 
 func cloneLinguist(linguistURL string) error {

diff --git a/common.go b/common.go
@@ -16,7 +16,7 @@ const OtherLanguage = ""
 // Strategy type fix the signature for the functions that can be used as a strategy.
 type Strategy func(filename string, content []byte, candidates []string) (languages []string)
 
-// DefaultStrategies is the strategies' sequence GetLanguage uses to detect languages.
+// DefaultStrategies is a sequence of strategies used by GetLanguage to detect languages.
 var DefaultStrategies = []Strategy{
 	GetLanguagesByModeline,
 	GetLanguagesByFilename,
@@ -396,12 +396,13 @@ func GetLanguagesByContent(filename string, content []byte, _ []string) []string
 	}
 
 	ext := strings.ToLower(filepath.Ext(filename))
-	fnMatcher, ok := data.ContentMatchers[ext]
+
+	heuristic, ok := data.ContentHeuristics[ext]
 	if !ok {
 		return nil
 	}
 
-	return fnMatcher(content)
+	return heuristic.Match(content)
 }
 
 // GetLanguagesByClassifier uses DefaultClassifier as a Classifier and returns a sorted slice of possible languages ordered by
@@ -454,9 +455,7 @@ func GetLanguageType(language string) (langType Type) {
 // GetLanguageByAlias returns either the language related to the given alias and ok set to true
 // or Otherlanguage and ok set to false if the alias is not recognized.
 func GetLanguageByAlias(alias string) (lang string, ok bool) {
-	a := strings.Split(alias, `,`)[0]
-	a = strings.ToLower(a)
-	lang, ok = data.LanguagesByAlias[a]
+	lang, ok = data.LanguageByAlias(alias)
 	if !ok {
 		lang = OtherLanguage
 	}

diff --git a/common_test.go b/common_test.go
@@ -11,6 +11,7 @@ import (
 	"gopkg.in/src-d/enry.v1/data"
 
 	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/require"
 	"github.com/stretchr/testify/suite"
 )
 
@@ -19,9 +20,36 @@ const linguistClonedEnvVar = "ENRY_TEST_REPO"
 
 type EnryTestSuite struct {
 	suite.Suite
-	repoLinguist string
-	samplesDir   string
-	cloned       bool
+	tmpLinguist string
+	needToClone bool
+	samplesDir  string
+}
+
+func (s *EnryTestSuite) TestRegexpEdgeCases() {
+	var regexpEdgeCases = []struct {
+		lang     string
+		filename string
+	}{
+		{lang: "ActionScript", filename: "FooBar.as"},
+		{lang: "Forth", filename: "asm.fr"},
+		{lang: "X PixMap", filename: "cc-public_domain_mark_white.pm"},
+		//{lang: "SQL", filename: "drop_stuff.sql"}, // https://github.com/src-d/enry/issues/194
+		{lang: "Fstar", filename: "Hacl.Spec.Bignum.Fmul.fst"},
+		{lang: "C++", filename: "Types.h"},
+	}
+
+	for _, r := range regexpEdgeCases {
+		filename := fmt.Sprintf("%s/samples/%s/%s", s.tmpLinguist, r.lang, r.filename)
+
+		content, err := ioutil.ReadFile(filename)
+		require.NoError(s.T(), err)
+
+		lang := GetLanguage(r.filename, content)
+		s.T().Logf("File:%s, lang:%s", filename, lang)
+
+		expLang, _ := data.LanguageByAlias(r.lang)
+		require.EqualValues(s.T(), expLang, lang)
+	}
 }
 
 func Test_EnryTestSuite(t *testing.T) {
@@ -30,25 +58,24 @@ func Test_EnryTestSuite(t *testing.T) {
 
 func (s *EnryTestSuite) SetupSuite() {
 	var err error
-	s.repoLinguist = os.Getenv(linguistClonedEnvVar)
-	s.cloned = s.repoLinguist == ""
-	if s.cloned {
-		s.repoLinguist, err = ioutil.TempDir("", "linguist-")
-		assert.NoError(s.T(), err)
-	}
-
-	s.samplesDir = filepath.Join(s.repoLinguist, "samples")
-
-	if s.cloned {
-		cmd := exec.Command("git", "clone", linguistURL, s.repoLinguist)
+	s.tmpLinguist = os.Getenv(linguistClonedEnvVar)
+	s.needToClone = s.tmpLinguist == ""
+	if s.needToClone {
+		s.tmpLinguist, err = ioutil.TempDir("", "linguist-")
+		require.NoError(s.T(), err)
+		s.T().Logf("Cloning Linguist repo to '%s' as %s was not set\n",
+			s.tmpLinguist, linguistClonedEnvVar)
+		cmd := exec.Command("git", "clone", linguistURL, s.tmpLinguist)
 		err = cmd.Run()
-		assert.NoError(s.T(), err)
+		require.NoError(s.T(), err)
 	}
+	s.samplesDir = filepath.Join(s.tmpLinguist, "samples")
+	s.T().Logf("using samples from %s", s.samplesDir)
 
 	cwd, err := os.Getwd()
 	assert.NoError(s.T(), err)
 
-	err = os.Chdir(s.repoLinguist)
+	err = os.Chdir(s.tmpLinguist)
 	assert.NoError(s.T(), err)
 
 	cmd := exec.Command("git", "checkout", data.LinguistCommit)
@@ -60,8 +87,8 @@ func (s *EnryTestSuite) SetupSuite() {
 }
 
 func (s *EnryTestSuite) TearDownSuite() {
-	if s.cloned {
-		err := os.RemoveAll(s.repoLinguist)
+	if s.needToClone {
+		err := os.RemoveAll(s.tmpLinguist)
 		assert.NoError(s.T(), err)
 	}
 }
@@ -88,7 +115,7 @@ func (s *EnryTestSuite) TestGetLanguage() {
 }
 
 func (s *EnryTestSuite) TestGetLanguagesByModelineLinguist() {
-	var modelinesDir = filepath.Join(s.repoLinguist, "test/fixtures/Data/Modelines")
+	var modelinesDir = filepath.Join(s.tmpLinguist, "test/fixtures/Data/Modelines")
 
 	tests := []struct {
 		name       string
@@ -400,15 +427,16 @@ func (s *EnryTestSuite) TestGetLanguageByAlias() {
 func (s *EnryTestSuite) TestLinguistCorpus() {
 	const filenamesDir = "filenames"
 	var cornerCases = map[string]bool{
-		"hello.ms": true,
+		"drop_stuff.sql": true, // https://github.com/src-d/enry/issues/194
+		// .es and .ice fail heuristics parsing, but do not fail any tests
 	}
 
 	var total, failed, ok, other int
 	var expected string
 	filepath.Walk(s.samplesDir, func(path string, f os.FileInfo, err error) error {
 		if f.IsDir() {
 			if f.Name() != filenamesDir {
-				expected = f.Name()
+				expected, _ = data.LanguageByAlias(f.Name())
 			}
 
 			return nil
@@ -431,17 +459,16 @@ func (s *EnryTestSuite) TestLinguistCorpus() {
 		} else {
 			status = "failed"
 			failed++
-
 		}
 
 		if _, ok := cornerCases[filename]; ok {
-			fmt.Printf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
+			s.T().Logf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
 		} else {
 			assert.Equal(s.T(), expected, obtained, fmt.Sprintf("%s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status))
 		}
 
 		return nil
 	})
 
-	fmt.Printf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
+	s.T().Logf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
 }