-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use language threshold to compute zim language metadata #230
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #230 +/- ##
========================================
- Coverage 5.00% 4.96% -0.04%
========================================
Files 8 8
Lines 1100 1108 +8
Branches 239 242 +3
========================================
Hits 55 55
- Misses 1044 1052 +8
Partials 1 1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Did you tested this change on a real case?
wildlife topic might be interesting, it is currently reporting too many languages: https://library.kiwix.org/raw/ted_mul_wildlife_2024-10/meta/Language (according to the new definition of the issue)
Corresponding Zimfarm recipe is at https://farm.openzim.org/recipes/ted_topic_wildlife (for inspiration around settings).
Yeah, I tested it on the example. However, that contained a lot of entries with one language. Will run it against the wildlife topic and let you know. |
@benoit74 , I think the way the
The one where everything is 1 is the existing implementation. It does not sort the elements before passing to What do you think? |
Um, is there a way for inspecting the metadata of a ZIM from CLI? I tried the |
Since you probably have python and the python-libzim somewhere locally, you can open the ZIM with python-libzim and print metadata:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, LGTM!
Rationale
Consider the frequency of a language appearance across videos when computing ZIM Language Metadata. This resolves #212
Changes
--language-threshold
(float between 0 and 1) for setting the minimum threshold.