-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not return empty lang. #205
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the ideal API for language detection would be to always return a list of detected languages sorted by confidence, together with those confidences. The same way go-license-detector works. Then we wouldn't have such problems at all.
If we always return all candidates it will have an impact on performance. So, I assume we start with the most-likely strategies and at the end we check extensions, aliases, etc. |
If there are several candidates with the same confidence, I will never trust the library to make any decisions for myself 😄 Anyway, I am not proposing to run all the strategies just for the sake of returning a comprehensive list. I am proposing to return all the available information every time, nothing more. That is, if some strategy worked, it had priority over the rest and it output a single item - we return just that item with confidence 1.00. In other words, it would be better to make the user run |
It's better to return any potential candidate than nothing. Signed-off-by: kuba-- <[email protected]>
So, this is basically what my PR does. I modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for a patch! From the first glance, that looks good.
As we do run a number of classifiers that narrow down the options and some(most) of them do not have a notion of confidence- we cann't always do what I agree would be an ideal and what @vmarkovtsev suggests.
Though, as this changes the behavior of the classifier and it's expected output, it will take a bit of time for me verify how this affects clients and accuracy.
If that is not very urgent - I will get back to this asap, right after nailing down a bit more pressing issue with accuracy from #204
@bzz - in my opinion it will only change the behaviour for cases where classifier returned an empty slice. It's not a urgent case, but the gitbase issue src-d/gitbase#731 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuba-- Classify, as a strategy, is a part of a low-level API in enry, designed to mimic linguist and that is used by a high-level API in enry.Get*
. I must confess, our test coverage is not ideal and does not allow to access the proposed fix automatically (although it should).
As far as I can tell, if there is a bug in GetLanguages
(that's what I got from src-d/gitbase#731 (comment)) and I would advice to fix it in GetLanguages
by returning a list of candidates instead of empty slice of languages.
The reason is that such fix, without touching a low-level part, would be much easier to asses and verify correctness by inspection.
@bzz - this is how I started, but many tests failed. Moreover returning
But this is exactly what |
@@ -339,7 +341,7 @@ func (s *EnryTestSuite) TestGetLanguagesBySpecificClassifier() { | |||
{name: "TestGetLanguagesByClassifier_4", filename: filepath.Join(s.samplesDir, "C/blob.c"), candidates: []string{"python", "ruby", "c++"}, classifier: DefaultClassifier, expected: "C++"}, | |||
{name: "TestGetLanguagesByClassifier_5", filename: filepath.Join(s.samplesDir, "C/blob.c"), candidates: []string{"ruby"}, classifier: DefaultClassifier, expected: "Ruby"}, | |||
{name: "TestGetLanguagesByClassifier_6", filename: filepath.Join(s.samplesDir, "Python/django-models-base.py"), candidates: []string{"python", "ruby", "c", "c++"}, classifier: DefaultClassifier, expected: "Python"}, | |||
{name: "TestGetLanguagesByClassifier_7", filename: os.DevNull, candidates: nil, classifier: DefaultClassifier, expected: OtherLanguage}, | |||
{name: "TestGetLanguagesByClassifier_7", filename: os.DevNull, candidates: nil, classifier: DefaultClassifier, expected: "XML"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does not make a lot sense, doesn't it? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, it doesn't. So the question is why classifier for empty list of candidates and empty filename returned a non empty slice.
If you take a look on classify
code the only difference between what was and what is is the case where the blob
is nil/empty.
Before we returned an empty slice (even if we had candidates), right now we just don't count extra probability based on tokens. That's it. So the issue with empty candidates and empty file is an another thing.
We can think about setting some probability threshold (and everything below will get ""
), because I believe in this case XML had very low probability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuba-- thank you for explanation, both points make sense and my be very well worth mentioning in initial PR description ;)
I guess what I was trying to say that instead of keep testing for a known bug with unknown impact on accuracy (outside of the small sample) I would prefer a proper fix of what very much looks like a bug in GetLanguages
:) but that of course could also be done outside of this PR
#207 was created to track that.
Thanks again for the patch!
Signed-off-by: kuba-- [email protected]
So far for
.h
files theGetLanguages
function return an empty slice (even if internally list of candidates wasC, C++, Objective-C
.In my opinion it's better to return any potential candidate than nothing.