use language threshold to compute zim language metadata #230

elfkuzco · 2024-10-24T13:41:22Z

Rationale

Consider the frequency of a language appearance across videos when computing ZIM Language Metadata. This resolves #212

Changes

Add new CLI argument --language-threshold (float between 0 and 1) for setting the minimum threshold.
Skip languages whose appearance ratio across all videos is not greater than the specified percentage when building the languages metadata

codecov · 2024-10-24T13:42:25Z

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 4.96%. Comparing base (4acb963) to head (66c6fe4).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/ted2zim/scraper.py	0.00%	5 Missing ⚠️
src/ted2zim/entrypoint.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main    #230      +/-   ##
========================================
- Coverage   5.00%   4.96%   -0.04%     
========================================
  Files          8       8              
  Lines       1100    1108       +8     
  Branches     239     242       +3     
========================================
  Hits          55      55              
- Misses      1044    1052       +8     
  Partials       1       1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74

Thanks a lot! Did you tested this change on a real case?

wildlife topic might be interesting, it is currently reporting too many languages: https://library.kiwix.org/raw/ted_mul_wildlife_2024-10/meta/Language (according to the new definition of the issue)

Corresponding Zimfarm recipe is at https://farm.openzim.org/recipes/ted_topic_wildlife (for inspiration around settings).

elfkuzco · 2024-10-24T15:50:22Z

Thanks a lot! Did you tested this change on a real case?

wildlife topic might be interesting, it is currently reporting too many languages: https://library.kiwix.org/raw/ted_mul_wildlife_2024-10/meta/Language (according to the new definition of the issue)

Corresponding Zimfarm recipe is at https://farm.openzim.org/recipes/ted_topic_wildlife (for inspiration around settings).

Yeah, I tested it on the example. However, that contained a lot of entries with one language. Will run it against the wildlife topic and let you know.

elfkuzco · 2024-10-24T16:44:59Z

@benoit74 , I think the way the _lang_counts are computed might not be right because it uses groupby on unsorted data. I ran the script and added a couple of print statements and here is what I observed:

subtitle_lang_counts sorted before group:  {'ar': 4, 'de': 1, 'en': 4, 'es': 4, 'fa': 1, 'fr': 2, 'he': 2, 'hu': 1, 'id': 1, 'it': 1, 'ko': 2, 'pt': 2, 'pt-br': 1, 'ro': 1, 'sv': 1, 'th': 1, 'tr': 2, 'vi': 3, 'zh-cn': 2, 'zh-tw': 3}
subtitle_lang_counts unsorted before group:  {'en': 1, 'es': 1, 'ar': 1, 'fr': 1, 'de': 1, 'th': 1, 'vi': 1, 'ko': 1, 'zh-tw': 1, 'ro': 1, 'sv': 1, 'zh-cn': 1, 'it': 1, 'pt-br': 1, 'he': 1, 'fa': 1, 'hu': 1, 'id': 1, 'tr': 1, 'pt': 1}

The one where everything is 1 is the existing implementation. It does not sort the elements before passing to groupby. The one with entries greater than 1 are something I applied now. I looked up the groupby method and it is not exactly SQL's GROUP BY statement. Rather, it emulates the uniq Linux command in that it generates a new group every time the value of the subtitle language changes. That explains why every subtitle language will have a count of 1.

What do you think?

elfkuzco · 2024-10-24T19:01:52Z

wildlife topic might be interesting, it is currently reporting too many languages: https://library.kiwix.org/raw/ted_mul_wildlife_2024-10/meta/Language (according to the new definition of the issue)

Um, is there a way for inspecting the metadata of a ZIM from CLI? I tried the zimdump command but it doesn't show anything. So, far, I only added a debug statement to print the self.zim_languages attribute and that's how I can infer the filter works

benoit74 · 2024-10-24T20:17:52Z

Um, is there a way for inspecting the metadata of a ZIM from CLI? I tried the zimdump command but it doesn't show anything. So, far, I only added a debug statement to print the self.zim_languages attribute and that's how I can infer the filter works

Since you probably have python and the python-libzim somewhere locally, you can open the ZIM with python-libzim and print metadata:

from libzim import Archive
zim = Archive("test.zim")
item = zim.get_metadata_item("Language")
print(bytes(item.content).decode())

elfkuzco · 2024-10-25T09:37:23Z

I ran it now and it works as expected. Here's the output with the default threshold of 0.5

Here's the output of the appearance fractions

benoit74

Thanks a lot, LGTM!

use language threshold to compute zim language metadata

1cfc4c1

elfkuzco marked this pull request as draft October 24, 2024 14:09

elfkuzco marked this pull request as ready for review October 24, 2024 14:10

benoit74 self-requested a review October 24, 2024 14:33

benoit74 reviewed Oct 24, 2024

View reviewed changes

sort before grouping in language counts

66c6fe4

elfkuzco requested a review from benoit74 October 25, 2024 09:38

benoit74 approved these changes Oct 25, 2024

View reviewed changes

benoit74 merged commit e4d9b0e into openzim:main Oct 25, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use language threshold to compute zim language metadata #230

use language threshold to compute zim language metadata #230

elfkuzco commented Oct 24, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

benoit74 left a comment •

edited

Loading

elfkuzco commented Oct 24, 2024

elfkuzco commented Oct 24, 2024

elfkuzco commented Oct 24, 2024

benoit74 commented Oct 24, 2024 •

edited

Loading

elfkuzco commented Oct 25, 2024

benoit74 left a comment

use language threshold to compute zim language metadata #230

use language threshold to compute zim language metadata #230

Conversation

elfkuzco commented Oct 24, 2024

Rationale

Changes

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

benoit74 left a comment • edited Loading

Choose a reason for hiding this comment

elfkuzco commented Oct 24, 2024

elfkuzco commented Oct 24, 2024

elfkuzco commented Oct 24, 2024

benoit74 commented Oct 24, 2024 • edited Loading

elfkuzco commented Oct 25, 2024

benoit74 left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 24, 2024 •

edited

Loading

benoit74 left a comment •

edited

Loading

benoit74 commented Oct 24, 2024 •

edited

Loading