Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Closed
benoit74 opened this issue Feb 14, 2024 · 3 comments
Assignees
Labels
question Further information is requested
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Feb 14, 2024

This issue covers the "second part" of #177 where it has already been discussed.

The issue concerns resources which have multiple encoding inside (shouldn't exist ... but we are quite sure it does, even if Zimit2 test websites never had this issue).

A handcrafted" sample is a page like https://tmp.kiwix.org/ci/test-website/bad-encoding.html where:

  • the declared encoding is wrong (it is not saved in UTF-8 as declared in HTML meta tag but Windows 1252)
  • most of the file is in Windows 1252
  • two characters (four bytes) are not in Windows 1252 but Chinese GB2312
  • browsers display well most of the page content

We could decide to :

  1. stop/crash the scrapper (this is what will happen after Correctly detect encoding and decode bytes  #183 is merged)
  2. transfer the raw content as-is to the ZIM (without any rewriting)
  3. do our best to decode / rewrite as much as possible

If I'm not mistaken, @kelson42 has clearly indicated that only option 3 is acceptable from his PoV while @mgautierfr is more in favor of option 1.

I tend to prefer option 3 but consider this is not the highest priority issue we have on Zimit2, especially since we do not encountered the problem in test recipes.

@benoit74 benoit74 changed the title Do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside Feb 14, 2024
@mgautierfr
Copy link
Contributor

To be exact, I in favor of option 3, but only if it produces something usable[*]. If it cannot be done, then option 1.

[*]Definition of usable is not a easy task:

  • A page with Cyrillic content wrongly decoded and written back with a bunch of "garbage" Chinese characters is no usable.
  • The same page inside a whole zim correctly encoded/decoded makes the zim archive itself usable.
  • All pages wrongly decoded in the zim file and the zim archive is not usable.
  • A js script wrongly encoded/decoded sending stats to the server, we don't care
  • A js script wrongly encoded/decoded fetching content and setting up the html, we care.

@benoit74
Copy link
Collaborator Author

Given #221 insight, I significantly doubt there is anything more possible

@benoit74 benoit74 added this to the later milestone May 21, 2024
@kelson42 kelson42 added the question Further information is requested label Jul 26, 2024
@benoit74
Copy link
Collaborator Author

So far the scraper is not crashing anymore when there is multiple encoding in a single file, especially since #314

We are already close to option 3, only bad characters (in another encoding than the rest of the document) are "replaced" by "something".

I will hence close the issue, we have no track on how to handle this situation better than today, and there is nothing really annoying today. Current experience with warc2zim on https://tmp.kiwix.org/ci/test-website/bad-encoding.html is identical to the one on most browsers.

@benoit74 benoit74 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2024
@benoit74 benoit74 modified the milestones: later, 2.1.0 Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants