Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breakdown of django/django is different from Linguist #204

Closed
vmarkovtsev opened this issue Mar 6, 2019 · 8 comments · Fixed by #214
Closed

Breakdown of django/django is different from Linguist #204

vmarkovtsev opened this issue Mar 6, 2019 · 8 comments · Fixed by #214
Labels
Milestone

Comments

@vmarkovtsev
Copy link
Collaborator

As @Guillemdb found out while doing the ML challenge for source{d}, enry reports wrong results for django/django. Ground truth:

image

enry:

image

@bzz
Copy link
Contributor

bzz commented Mar 6, 2019

Thank you for the report, indeed numbers look a bit off! It's documented that some of the strategies do not produce the same results in https://github.com/src-d/enry#divergences-from-linguist

So far I'm not aware of the attempts to verify the accuracy of enry that goes beyond the samples dir of Linguist (as part of the tests, we make sure each strategy agrees with Linguist on every file there) and am looking forward to measuring and improving it under this issue!

Here are the results of reproducing this comparison locally with enry v1.7.7 and linguist 7.2.0 CLI

mkdir /tmp/linguist-django; cd /tmp/linguist-django
git clone --dept 1 [email protected]:django/django.git
cd -

enry -prog /tmp/linguist-django/django
67.85%	Python
31.04%	Modelica
1.02%	JavaScript
0.03%	Makefile
0.03%	Shell
0.03%	Smarty

./bin/github-linguist /tmp/linguist-django/django
95.80%  Python
1.90%   JavaScript
1.66%   HTML
0.63%   CSS
0.01%   Shell
0.00%   Smarty
0.00%   Makefile

Good start on this issue would consist of:

  • shell/shell/go script for CI to compare order of magnitude of the results of this breakdown (high-level)
  • dig deeper in CLI code to make sure "apples" are compared to "apples"
  • check if Modelica get's miss-classified somehow
  • integration test that over the same big codebase file by file and compare results with linguist "gold" results (low-level)

I also checked v1.6.8 and this does not seem to be a regression, so it most probably has been like that for a while.

Hope this helps, and I will be looking into this more next week.

@bzz bzz changed the title Wrong results on django/django Breakdown of django/django is different from Linguist Mar 6, 2019
@bzz
Copy link
Contributor

bzz commented Mar 15, 2019

Django uses .mo files for localization translation that are known as Modelica to linguist (even if they are not).

So, from a first glance - it seems like in linguist a content classifier was able to differentiate those but on enry it may be the second case of miss-classification from https://github.com/src-d/enry#divergences-from-linguist and most probably due to similar cause as #194

To 100% verify that one would need to run django.mo though linguist and enry with instrumentation from #188 enabled and create an issue Bayesian classifier cann't distinguish "Modelica" vs "XXX".

Wild guess is that fixing #193 would allow to get rid of all such cases, so I'm going to allocate some time next week and fix.

@bzz bzz added this to the v1.8.0 milestone Mar 15, 2019
@bzz
Copy link
Contributor

bzz commented Mar 25, 2019

Ok, the above explanation is although very probable, is not 100% plausible - it may also be another bug in enry cli

On a dir with 2 files

find /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES
 /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.mo
 /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.po

One is reported as binary with 30 sloc (wc -l is sure it's 29) and another as Gettext Catalog.

enry /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.mo
django.mo: 30 lines (30 sloc)
  type:      Binary
  mime_type: text/plain
  language:

enry -prog /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.po
django.po: 1301 lines (975 sloc)
  type:      Text
  mime_type: text/plain
  language:  Gettext Catalog

But if the whole dir is analyzed, it becomes

enry -prog /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES
50.00%	Gettext Catalog
50.00%	Modelica

Trying to detect language for django.mo with linguist returns nil what is consistent with individual file-level predictions.

@bzz
Copy link
Contributor

bzz commented Mar 25, 2019

So, by changing our CLI to:

  • call GetLanguage on every file, instead of what we do now - would remove wired Modelica
  • apply result filtering, equivalent to linguist
  • and not using -prog, so HTML and CSS is not excluded

should allow to get same CLI results as in Linguist.
Will submit a change a bit later.

@vmarkovtsev
Copy link
Collaborator Author

@bzz But how does it work right now if it doesn't call GetLanguage on every file?!

@bzz
Copy link
Contributor

bzz commented Mar 29, 2019

As one specific application using enry library, enry CLI instead of relying on high-level API of GetLanguages tries to avoid reading every file and use more low-level API in a smart way.

Some API calls do not need to have a content of the file, e.g GetLanguageByExtension and others always require it e.g GetLanguage thus if a first one already returns a plausible candidate - the file does not need to be read.

I would say that CLI should mimic github linguist CLI output/logic, at least in default configuration, and now it does not. Going to submit a change that does this soon.

bzz added a commit to bzz/enry that referenced this issue Apr 3, 2019
This includes next main changes:

 - default: print only Programming and Markup types
   as Linguist does
 - `-prog` option replaced with `-all`, to allow for
   previous behavior
 - always use GetLanguage as main source of truth
   that fixes src-d#204 and perf will be restored under src-d#212

Signed-off-by: Alexander Bezzubov <[email protected]>
@bzz
Copy link
Contributor

bzz commented Apr 4, 2019

As soon as #214 merged I'm going to make a new release where the CLI outputs will be the same, so it's going to help avoiding any such miss-understanding in the future #214 (comment) (no language detection logic changes)

@bzz bzz closed this as completed in #214 Apr 9, 2019
@bzz
Copy link
Contributor

bzz commented Apr 9, 2019

While a release is still blocked by CI, latest master results are

./enry /tmp/linguist-django/django
95.87%	Python
1.85%	JavaScript
1.65%	HTML
0.63%	CSS
0.01%	Shell
0.00%	Smarty
0.00%	Makefile

./bin/github-linguist /tmp/linguist-django/django
95.86%  Python
1.85%   JavaScript
1.65%   HTML
0.63%   CSS
0.01%   Shell
0.00%   Smarty
0.00%   Makefile

@bzz bzz modified the milestones: v2.1.x, v1.7.3 Apr 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants