Do not return empty lang. #205

kuba-- · 2019-03-14T12:29:17Z

So far for .h files the GetLanguages function return an empty slice (even if internally list of candidates was C, C++, Objective-C.
In my opinion it's better to return any potential candidate than nothing.

vmarkovtsev

I think that the ideal API for language detection would be to always return a list of detected languages sorted by confidence, together with those confidences. The same way go-license-detector works. Then we wouldn't have such problems at all.

kuba-- · 2019-03-14T12:49:40Z

If we always return all candidates it will have an impact on performance. So, I assume we start with the most-likely strategies and at the end we check extensions, aliases, etc.
Moreover, if you have a bunch of candidates with the same probability which one would you pick?
To make the results consistent and testable it will require a stable sort what also will have a bad impact on performance.

vmarkovtsev · 2019-03-14T12:57:13Z

If there are several candidates with the same confidence, I will never trust the library to make any decisions for myself 😄

Anyway, I am not proposing to run all the strategies just for the sake of returning a comprehensive list. I am proposing to return all the available information every time, nothing more. That is, if some strategy worked, it had priority over the rest and it output a single item - we return just that item with confidence 1.00.

In other words, it would be better to make the user run firstLanguage at their own logic and risk.

It's better to return any potential candidate than nothing. Signed-off-by: kuba-- <[email protected]>

kuba-- · 2019-03-14T13:11:42Z

So, this is basically what my PR does. I modified Classifier, so if you already have some candidates and probability assigned to them, then Classify returns a sorted slice.
Right now, Classify expects a content, so even if you have a list of candidates, but no content then you'll get an empty slice.

bzz

Thank you for a patch! From the first glance, that looks good.

As we do run a number of classifiers that narrow down the options and some(most) of them do not have a notion of confidence- we cann't always do what I agree would be an ideal and what @vmarkovtsev suggests.

Though, as this changes the behavior of the classifier and it's expected output, it will take a bit of time for me verify how this affects clients and accuracy.

If that is not very urgent - I will get back to this asap, right after nailing down a bit more pressing issue with accuracy from #204

kuba-- · 2019-03-15T13:05:14Z

@bzz - in my opinion it will only change the behaviour for cases where classifier returned an empty slice. It's not a urgent case, but the gitbase issue src-d/gitbase#731
depends on it.

bzz

@kuba-- Classify, as a strategy, is a part of a low-level API in enry, designed to mimic linguist and that is used by a high-level API in enry.Get*. I must confess, our test coverage is not ideal and does not allow to access the proposed fix automatically (although it should).

As far as I can tell, if there is a bug in GetLanguages (that's what I got from src-d/gitbase#731 (comment)) and I would advice to fix it in GetLanguages by returning a list of candidates instead of empty slice of languages.

The reason is that such fix, without touching a low-level part, would be much easier to asses and verify correctness by inspection.

kuba-- · 2019-03-15T16:02:48Z

@bzz - this is how I started, but many tests failed. Moreover returning candidates if languages are empty is unpredictable (every time you'll get the slice in different order), because internally you use a map as a counter and rewrite candidates from a map (again) to the slice, so ordering is not guaranteed.
If you want to make it consistent you will have to:

dedup candidates and keep frequency counters somewhere
sort by counters.

But this is exactly what classify does (besides it also reads content and add probability based on read tokens). So, it will duplicate classify code in 80% (impact on performance) to return the same result.
Last but not least - most of tools (and for sure tests) rely on GetLanguages so if there is any risk that classify will break compatibility, then it's the same risk if we return something unexpected from GetLanguages

bzz · 2019-03-15T17:47:56Z

common_test.go

@@ -339,7 +341,7 @@ func (s *EnryTestSuite) TestGetLanguagesBySpecificClassifier() {
 		{name: "TestGetLanguagesByClassifier_4", filename: filepath.Join(s.samplesDir, "C/blob.c"), candidates: []string{"python", "ruby", "c++"}, classifier: DefaultClassifier, expected: "C++"},
 		{name: "TestGetLanguagesByClassifier_5", filename: filepath.Join(s.samplesDir, "C/blob.c"), candidates: []string{"ruby"}, classifier: DefaultClassifier, expected: "Ruby"},
 		{name: "TestGetLanguagesByClassifier_6", filename: filepath.Join(s.samplesDir, "Python/django-models-base.py"), candidates: []string{"python", "ruby", "c", "c++"}, classifier: DefaultClassifier, expected: "Python"},
-		{name: "TestGetLanguagesByClassifier_7", filename: os.DevNull, candidates: nil, classifier: DefaultClassifier, expected: OtherLanguage},
+		{name: "TestGetLanguagesByClassifier_7", filename: os.DevNull, candidates: nil, classifier: DefaultClassifier, expected: "XML"},


That does not make a lot sense, doesn't it? :)

yep, it doesn't. So the question is why classifier for empty list of candidates and empty filename returned a non empty slice.

If you take a look on classify code the only difference between what was and what is is the case where the blob is nil/empty.
Before we returned an empty slice (even if we had candidates), right now we just don't count extra probability based on tokens. That's it. So the issue with empty candidates and empty file is an another thing.
We can think about setting some probability threshold (and everything below will get ""), because I believe in this case XML had very low probability.

bzz

@kuba-- thank you for explanation, both points make sense and my be very well worth mentioning in initial PR description ;)

I guess what I was trying to say that instead of keep testing for a known bug with unknown impact on accuracy (outside of the small sample) I would prefer a proper fix of what very much looks like a bug in GetLanguages :) but that of course could also be done outside of this PR

#207 was created to track that.

Thanks again for the patch!

kuba-- mentioned this pull request Mar 14, 2019

Update enry dependency to 1.7.2 src-d/gitbase#731

Merged

vmarkovtsev approved these changes Mar 14, 2019

View reviewed changes

Do not return empty lang.

5adfee5

It's better to return any potential candidate than nothing. Signed-off-by: kuba-- <[email protected]>

bzz self-requested a review March 15, 2019 08:48

bzz reviewed Mar 15, 2019

View reviewed changes

bzz assigned kuba-- Mar 15, 2019

bzz added the bug label Mar 15, 2019

bzz suggested changes Mar 15, 2019

View reviewed changes

bzz reviewed Mar 15, 2019

View reviewed changes

bzz mentioned this pull request Mar 15, 2019

GetLanguages never return candidates if single language was not found #207

Closed

bzz approved these changes Mar 15, 2019

View reviewed changes

bzz merged commit 6526da7 into src-d:master Mar 15, 2019

kuba-- deleted the fix-langs branch March 15, 2019 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not return empty lang. #205

Do not return empty lang. #205

kuba-- commented Mar 14, 2019

vmarkovtsev left a comment

kuba-- commented Mar 14, 2019

vmarkovtsev commented Mar 14, 2019

kuba-- commented Mar 14, 2019

bzz left a comment

kuba-- commented Mar 15, 2019

bzz left a comment

kuba-- commented Mar 15, 2019 •

edited

Loading

bzz Mar 15, 2019

kuba-- Mar 15, 2019

bzz left a comment

Do not return empty lang. #205

Do not return empty lang. #205

Conversation

kuba-- commented Mar 14, 2019

vmarkovtsev left a comment

Choose a reason for hiding this comment

kuba-- commented Mar 14, 2019

vmarkovtsev commented Mar 14, 2019

kuba-- commented Mar 14, 2019

bzz left a comment

Choose a reason for hiding this comment

kuba-- commented Mar 15, 2019

bzz left a comment

Choose a reason for hiding this comment

kuba-- commented Mar 15, 2019 • edited Loading

bzz Mar 15, 2019

Choose a reason for hiding this comment

kuba-- Mar 15, 2019

Choose a reason for hiding this comment

bzz left a comment

Choose a reason for hiding this comment

kuba-- commented Mar 15, 2019 •

edited

Loading