Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

benoit74 · 2024-02-14T12:53:33Z

This issue covers the "second part" of #177 where it has already been discussed.

The issue concerns resources which have multiple encoding inside (shouldn't exist ... but we are quite sure it does, even if Zimit2 test websites never had this issue).

A handcrafted" sample is a page like https://tmp.kiwix.org/ci/test-website/bad-encoding.html where:

the declared encoding is wrong (it is not saved in UTF-8 as declared in HTML meta tag but Windows 1252)
most of the file is in Windows 1252
two characters (four bytes) are not in Windows 1252 but Chinese GB2312
browsers display well most of the page content

We could decide to :

stop/crash the scrapper (this is what will happen after Correctly detect encoding and decode bytes #183 is merged)
transfer the raw content as-is to the ZIM (without any rewriting)
do our best to decode / rewrite as much as possible

If I'm not mistaken, @kelson42 has clearly indicated that only option 3 is acceptable from his PoV while @mgautierfr is more in favor of option 1.

I tend to prefer option 3 but consider this is not the highest priority issue we have on Zimit2, especially since we do not encountered the problem in test recipes.

mgautierfr · 2024-02-14T14:36:29Z

To be exact, I in favor of option 3, but only if it produces something usable[*]. If it cannot be done, then option 1.

[*]Definition of usable is not a easy task:

A page with Cyrillic content wrongly decoded and written back with a bunch of "garbage" Chinese characters is no usable.
The same page inside a whole zim correctly encoded/decoded makes the zim archive itself usable.
All pages wrongly decoded in the zim file and the zim archive is not usable.
A js script wrongly encoded/decoded sending stats to the server, we don't care
A js script wrongly encoded/decoded fetching content and setting up the html, we care.

benoit74 · 2024-05-17T15:58:38Z

Given #221 insight, I significantly doubt there is anything more possible

benoit74 · 2024-07-26T13:17:21Z

So far the scraper is not crashing anymore when there is multiple encoding in a single file, especially since #314

We are already close to option 3, only bad characters (in another encoding than the rest of the document) are "replaced" by "something".

I will hence close the issue, we have no track on how to handle this situation better than today, and there is nothing really annoying today. Current experience with warc2zim on https://tmp.kiwix.org/ci/test-website/bad-encoding.html is identical to the one on most browsers.

benoit74 changed the title ~~Do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside~~ Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside Feb 14, 2024

benoit74 mentioned this issue Feb 14, 2024

Failure of thalesdoc: invalid UTF-8 start byte in an HTML file #177

Closed

benoit74 mentioned this issue May 14, 2024

Zimit2: another encoding problem with solidarité-numérique #221

Closed

benoit74 added this to the later milestone May 21, 2024

kelson42 assigned benoit74 Jul 26, 2024

kelson42 added the question Further information is requested label Jul 26, 2024

benoit74 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2024

benoit74 modified the milestones: later, 2.1.0 Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

benoit74 commented Feb 14, 2024 •

edited

Loading

mgautierfr commented Feb 14, 2024

benoit74 commented May 17, 2024

benoit74 commented Jul 26, 2024

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Comments

benoit74 commented Feb 14, 2024 • edited Loading

mgautierfr commented Feb 14, 2024

benoit74 commented May 17, 2024

benoit74 commented Jul 26, 2024

benoit74 commented Feb 14, 2024 •

edited

Loading