Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use language threshold to compute zim language metadata #230

Merged
merged 2 commits into from
Oct 25, 2024

Conversation

elfkuzco
Copy link
Contributor

Rationale

Consider the frequency of a language appearance across videos when computing ZIM Language Metadata. This resolves #212

Changes

  • Add new CLI argument --language-threshold (float between 0 and 1) for setting the minimum threshold.
  • Skip languages whose appearance ratio across all videos is not greater than the specified percentage when building the languages metadata

Copy link

codecov bot commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 4.96%. Comparing base (4acb963) to head (66c6fe4).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/ted2zim/scraper.py 0.00% 5 Missing ⚠️
src/ted2zim/entrypoint.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #230      +/-   ##
========================================
- Coverage   5.00%   4.96%   -0.04%     
========================================
  Files          8       8              
  Lines       1100    1108       +8     
  Branches     239     242       +3     
========================================
  Hits          55      55              
- Misses      1044    1052       +8     
  Partials       1       1              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@elfkuzco elfkuzco marked this pull request as draft October 24, 2024 14:09
@elfkuzco elfkuzco marked this pull request as ready for review October 24, 2024 14:10
@benoit74 benoit74 self-requested a review October 24, 2024 14:33
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Did you tested this change on a real case?

wildlife topic might be interesting, it is currently reporting too many languages: https://library.kiwix.org/raw/ted_mul_wildlife_2024-10/meta/Language (according to the new definition of the issue)

Corresponding Zimfarm recipe is at https://farm.openzim.org/recipes/ted_topic_wildlife (for inspiration around settings).

@elfkuzco
Copy link
Contributor Author

Thanks a lot! Did you tested this change on a real case?

wildlife topic might be interesting, it is currently reporting too many languages: https://library.kiwix.org/raw/ted_mul_wildlife_2024-10/meta/Language (according to the new definition of the issue)

Corresponding Zimfarm recipe is at https://farm.openzim.org/recipes/ted_topic_wildlife (for inspiration around settings).

Yeah, I tested it on the example. However, that contained a lot of entries with one language. Will run it against the wildlife topic and let you know.

@elfkuzco
Copy link
Contributor Author

@benoit74 , I think the way the _lang_counts are computed might not be right because it uses groupby on unsorted data. I ran the script and added a couple of print statements and here is what I observed:

subtitle_lang_counts sorted before group:  {'ar': 4, 'de': 1, 'en': 4, 'es': 4, 'fa': 1, 'fr': 2, 'he': 2, 'hu': 1, 'id': 1, 'it': 1, 'ko': 2, 'pt': 2, 'pt-br': 1, 'ro': 1, 'sv': 1, 'th': 1, 'tr': 2, 'vi': 3, 'zh-cn': 2, 'zh-tw': 3}
subtitle_lang_counts unsorted before group:  {'en': 1, 'es': 1, 'ar': 1, 'fr': 1, 'de': 1, 'th': 1, 'vi': 1, 'ko': 1, 'zh-tw': 1, 'ro': 1, 'sv': 1, 'zh-cn': 1, 'it': 1, 'pt-br': 1, 'he': 1, 'fa': 1, 'hu': 1, 'id': 1, 'tr': 1, 'pt': 1}

The one where everything is 1 is the existing implementation. It does not sort the elements before passing to groupby. The one with entries greater than 1 are something I applied now. I looked up the groupby method and it is not exactly SQL's GROUP BY statement. Rather, it emulates the uniq Linux command in that it generates a new group every time the value of the subtitle language changes. That explains why every subtitle language will have a count of 1.

What do you think?

@elfkuzco
Copy link
Contributor Author

wildlife topic might be interesting, it is currently reporting too many languages: https://library.kiwix.org/raw/ted_mul_wildlife_2024-10/meta/Language (according to the new definition of the issue)

Um, is there a way for inspecting the metadata of a ZIM from CLI? I tried the zimdump command but it doesn't show anything. So, far, I only added a debug statement to print the self.zim_languages attribute and that's how I can infer the filter works

@benoit74
Copy link
Collaborator

benoit74 commented Oct 24, 2024

Um, is there a way for inspecting the metadata of a ZIM from CLI? I tried the zimdump command but it doesn't show anything. So, far, I only added a debug statement to print the self.zim_languages attribute and that's how I can infer the filter works

Since you probably have python and the python-libzim somewhere locally, you can open the ZIM with python-libzim and print metadata:

from libzim import Archive
zim = Archive("test.zim")
item = zim.get_metadata_item("Language")
print(bytes(item.content).decode())

@elfkuzco
Copy link
Contributor Author

I ran it now and it works as expected. Here's the output with the default threshold of 0.5
Screenshot_20241025_102056

Here's the output of the appearance fractions
Screenshot_20241025_103626

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, LGTM!

@benoit74 benoit74 merged commit e4d9b0e into openzim:main Oct 25, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit rules to add a lang to ZIM Language metadata
2 participants