Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not return empty lang. #205

Merged
merged 1 commit into from
Mar 15, 2019
Merged

Do not return empty lang. #205

merged 1 commit into from
Mar 15, 2019

Conversation

kuba--
Copy link

@kuba-- kuba-- commented Mar 14, 2019

Signed-off-by: kuba-- [email protected]

So far for .h files the GetLanguages function return an empty slice (even if internally list of candidates was C, C++, Objective-C.
In my opinion it's better to return any potential candidate than nothing.

Copy link
Collaborator

@vmarkovtsev vmarkovtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the ideal API for language detection would be to always return a list of detected languages sorted by confidence, together with those confidences. The same way go-license-detector works. Then we wouldn't have such problems at all.

@kuba--
Copy link
Author

kuba-- commented Mar 14, 2019

If we always return all candidates it will have an impact on performance. So, I assume we start with the most-likely strategies and at the end we check extensions, aliases, etc.
Moreover, if you have a bunch of candidates with the same probability which one would you pick?
To make the results consistent and testable it will require a stable sort what also will have a bad impact on performance.

@vmarkovtsev
Copy link
Collaborator

If there are several candidates with the same confidence, I will never trust the library to make any decisions for myself 😄

Anyway, I am not proposing to run all the strategies just for the sake of returning a comprehensive list. I am proposing to return all the available information every time, nothing more. That is, if some strategy worked, it had priority over the rest and it output a single item - we return just that item with confidence 1.00.

In other words, it would be better to make the user run firstLanguage at their own logic and risk.

It's better to return any potential candidate than nothing.

Signed-off-by: kuba-- <[email protected]>
@kuba--
Copy link
Author

kuba-- commented Mar 14, 2019

So, this is basically what my PR does. I modified Classifier, so if you already have some candidates and probability assigned to them, then Classify returns a sorted slice.
Right now, Classify expects a content, so even if you have a list of candidates, but no content then you'll get an empty slice.

@bzz bzz self-requested a review March 15, 2019 08:48
Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a patch! From the first glance, that looks good.

As we do run a number of classifiers that narrow down the options and some(most) of them do not have a notion of confidence- we cann't always do what I agree would be an ideal and what @vmarkovtsev suggests.

Though, as this changes the behavior of the classifier and it's expected output, it will take a bit of time for me verify how this affects clients and accuracy.

If that is not very urgent - I will get back to this asap, right after nailing down a bit more pressing issue with accuracy from #204

@kuba--
Copy link
Author

kuba-- commented Mar 15, 2019

@bzz - in my opinion it will only change the behaviour for cases where classifier returned an empty slice. It's not a urgent case, but the gitbase issue src-d/gitbase#731
depends on it.

@bzz bzz added the bug label Mar 15, 2019
Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuba-- Classify, as a strategy, is a part of a low-level API in enry, designed to mimic linguist and that is used by a high-level API in enry.Get*. I must confess, our test coverage is not ideal and does not allow to access the proposed fix automatically (although it should).

As far as I can tell, if there is a bug in GetLanguages (that's what I got from src-d/gitbase#731 (comment)) and I would advice to fix it in GetLanguages by returning a list of candidates instead of empty slice of languages.

The reason is that such fix, without touching a low-level part, would be much easier to asses and verify correctness by inspection.

@kuba--
Copy link
Author

kuba-- commented Mar 15, 2019

@bzz - this is how I started, but many tests failed. Moreover returning candidates if languages are empty is unpredictable (every time you'll get the slice in different order), because internally you use a map as a counter and rewrite candidates from a map (again) to the slice, so ordering is not guaranteed.
If you want to make it consistent you will have to:

  • dedup candidates and keep frequency counters somewhere
  • sort by counters.

But this is exactly what classify does (besides it also reads content and add probability based on read tokens). So, it will duplicate classify code in 80% (impact on performance) to return the same result.
Last but not least - most of tools (and for sure tests) rely on GetLanguages so if there is any risk that classify will break compatibility, then it's the same risk if we return something unexpected from GetLanguages

@@ -339,7 +341,7 @@ func (s *EnryTestSuite) TestGetLanguagesBySpecificClassifier() {
{name: "TestGetLanguagesByClassifier_4", filename: filepath.Join(s.samplesDir, "C/blob.c"), candidates: []string{"python", "ruby", "c++"}, classifier: DefaultClassifier, expected: "C++"},
{name: "TestGetLanguagesByClassifier_5", filename: filepath.Join(s.samplesDir, "C/blob.c"), candidates: []string{"ruby"}, classifier: DefaultClassifier, expected: "Ruby"},
{name: "TestGetLanguagesByClassifier_6", filename: filepath.Join(s.samplesDir, "Python/django-models-base.py"), candidates: []string{"python", "ruby", "c", "c++"}, classifier: DefaultClassifier, expected: "Python"},
{name: "TestGetLanguagesByClassifier_7", filename: os.DevNull, candidates: nil, classifier: DefaultClassifier, expected: OtherLanguage},
{name: "TestGetLanguagesByClassifier_7", filename: os.DevNull, candidates: nil, classifier: DefaultClassifier, expected: "XML"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not make a lot sense, doesn't it? :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, it doesn't. So the question is why classifier for empty list of candidates and empty filename returned a non empty slice.

If you take a look on classify code the only difference between what was and what is is the case where the blob is nil/empty.
Before we returned an empty slice (even if we had candidates), right now we just don't count extra probability based on tokens. That's it. So the issue with empty candidates and empty file is an another thing.
We can think about setting some probability threshold (and everything below will get ""), because I believe in this case XML had very low probability.

Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuba-- thank you for explanation, both points make sense and my be very well worth mentioning in initial PR description ;)

I guess what I was trying to say that instead of keep testing for a known bug with unknown impact on accuracy (outside of the small sample) I would prefer a proper fix of what very much looks like a bug in GetLanguages :) but that of course could also be done outside of this PR

#207 was created to track that.

Thanks again for the patch!

@bzz bzz merged commit 6526da7 into src-d:master Mar 15, 2019
@kuba-- kuba-- deleted the fix-langs branch March 15, 2019 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants