-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breakdown of django/django is different from Linguist #204
Comments
Thank you for the report, indeed numbers look a bit off! It's documented that some of the strategies do not produce the same results in https://github.com/src-d/enry#divergences-from-linguist So far I'm not aware of the attempts to verify the accuracy of enry that goes beyond the Here are the results of reproducing this comparison locally with
Good start on this issue would consist of:
I also checked v1.6.8 and this does not seem to be a regression, so it most probably has been like that for a while. Hope this helps, and I will be looking into this more next week. |
Django uses So, from a first glance - it seems like in linguist a content classifier was able to differentiate those but on enry it may be the second case of miss-classification from https://github.com/src-d/enry#divergences-from-linguist and most probably due to similar cause as #194 To 100% verify that one would need to run Wild guess is that fixing #193 would allow to get rid of all such cases, so I'm going to allocate some time next week and fix. |
Ok, the above explanation is although very probable, is not 100% plausible - it may also be another bug in On a dir with 2 files
One is reported as
But if the whole dir is analyzed, it becomes
Trying to detect language for |
So, by changing our CLI to:
should allow to get same CLI results as in Linguist. |
@bzz But how does it work right now if it doesn't call |
As one specific application using enry library, enry CLI instead of relying on high-level API of Some API calls do not need to have a content of the file, e.g I would say that CLI should mimic github linguist CLI output/logic, at least in default configuration, and now it does not. Going to submit a change that does this soon. |
This includes next main changes: - default: print only Programming and Markup types as Linguist does - `-prog` option replaced with `-all`, to allow for previous behavior - always use GetLanguage as main source of truth that fixes src-d#204 and perf will be restored under src-d#212 Signed-off-by: Alexander Bezzubov <[email protected]>
As soon as #214 merged I'm going to make a new release where the CLI outputs will be the same, so it's going to help avoiding any such miss-understanding in the future #214 (comment) (no language detection logic changes) |
While a release is still blocked by CI, latest master results are
|
As @Guillemdb found out while doing the ML challenge for source{d},
enry
reports wrong results for django/django. Ground truth:enry:
The text was updated successfully, but these errors were encountered: