Failure of `thalesdoc`: invalid UTF-8 start byte in an HTML file #177

benoit74 · 2024-02-09T11:03:39Z

With latest zimit2 image (including with warc2zim 2.0.0-dev3), we face a new issue (I processed locally the full crawl I recently uploaded at https://tmp.kiwix.org/ci/test-warc/thalesdoc_en_all_full_2024-02-04/) around invalid UTF-8 in an HTML file.

Last log lines:

[warc2zim::2024-02-09 10:44:49,443] WARNING:Css transformation fails. Fallback to regex rewriter.
Content is `border-left-style: solid;border-left-width: 1px;border-left-color: #c0c0c0;border-right-style: solid;border-right-width: 1px;border-right-color: #c0c0c0;border-top-style: solid;border-top-width: 1px;border-top-color: #c0c0c0;border-bottom-style: solid;border-bottom-width: 1px;border-bottom-color: #c0c0c0;w`
[warc2zim::2024-02-09 10:45:10,064] WARNING:Css transformation fails. Fallback to regex rewriter.
Content is `width: 100%; cellspacing=`
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/main.py", line 90, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 264, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 493, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/items.py", line 38, in __init__
    (self.title, self.content) = Rewriter(path, record, known_urls).rewrite(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 36, in rewrite
    return self.rewrite_html(head_template, css_insert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 72, in rewrite_html
    return HtmlRewriter(self.url_rewriter, head_insert, css_insert).rewrite(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/html.py", line 79, in rewrite
    content = to_string(content)
              ^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/utils.py", line 47, in to_string
    input_ = input_.decode(  # pyright: ignore[reportGeneralTypeIssues, reportAttributeAccessIssue]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/encodings/utf_8_sig.py", line 23, in decode
    (output, consumed) = codecs.utf_8_decode(input, errors, True)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 6224: invalid start byte

The text was updated successfully, but these errors were encountered:

mgautierfr · 2024-02-13T15:55:56Z

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 6224: invalid start byte

Seems like utf-8 is not the encoding of the content. We should not assume content is utf8 and detect how we must decode the content before rewriting.

kelson42 · 2024-02-13T16:57:16Z

If we crash each time we have and encoding problem (assuming we know what is the proper encoding), then we have a problem because this is a very common scenario. We have to be more flexible on this scenario.

benoit74 · 2024-02-13T17:21:04Z

@kelson42 I'm not sure to get what you mean by "assuming we know what is the proper encoding" and "be more flexible"

The PR #183 is doing our best to guess what could be the encoding based on different strategies. If this fails, what do you expect the scraper behavior to be?

kelson42 · 2024-02-14T07:12:34Z

The PR #183 is doing our best to guess what could be the encoding based on different strategies.

This is to detect the charset used, this is what I mean with "assuming we know what is the proper encoding".

But the rest of my comment applies: If we crash each time we have and encoding problem (...) then we have a problem because this is a very common scenario. We have to be more flexible on this scenario.

Using another adjective: we must be tolerant with text encoding errors.

benoit74 · 2024-02-14T08:04:52Z

Thank you, it is way clearer.

How do you imagine we should proceed when decoding fails due to text encoding errors?

Simply not rewriting the HTML/CSS/JS and logging a clear warning is ok?

An alternative could be to split the content in chunks until we isolate all faulty bytes, then replace them by something 'identifiable', rewrite, and finally put the faulty part back in place. While this would provide value in some cases (it works well if the faulty part is some text in an HTML tag for instance), this looks pretty complex and won't work if the faulty part is some important code (HTML tag attribute names, JS code, ...)

kelson42 · 2024-02-14T09:13:10Z

HTML parsing is not like XML parsing! Crashing and not rewriting is not an option. If you have a HTML parser which is not able to go over a few wrongly encoded characters, then you will have it change it I guess.

Check the story of XHTML ;)

mgautierfr · 2024-02-14T09:48:32Z

The problem is not even not crashing when we cannot detect an encoding. I haven't found such a content. And we could "easily" loop over all known encoding to decode until we succeed. Or decoded as "utf-8" with a replacement char for unknown encoding.
And, as stopping the process is better to creating unusable zim, I tend to say we should stop (crash is a way to stop) if we cannot process a content.

The real problem, is succeeded to decode a content but still using a wrong encoded.
This a well know problem known from the beginning of the encoding (https://duckduckgo.com/?q=why+I+have+chinese+character+on+my+webpage&t=ffab&ia=web)
In this case, we have no way to detect the encoding is wrong and we have decoded to garbage.

And as we store this (unicode) garbage in utf-8 in the zim file, user will have no way to recover to the original content.

kelson42 · 2024-02-14T09:53:53Z

And, as stopping the process is better to creating unusable zim, I tend to say we should stop (crash is a way to stop) if we cannot process a content.

Absolutly, but of this happen too often for "stupid" reason, then this is also a problem. If it display properly in browsers, I see no reason why our HTML parser shozld fail.

mgautierfr · 2024-02-14T10:00:06Z

If it display properly in browsers, I see no reason why our HTML parser shozld fail.

Browser do not display properly. If generate "replacement char". Then user can ask to change the encoding (which is possible as browser as access to the encoded bytes).

Once we have decoded the content (wrongly or not) and re-encoded it in utf-8, you cannot change the encoding when you read it.

mgautierfr · 2024-02-14T10:06:47Z

See this content : https://community.mozilla.org/wp-content/plugins/events-manager/includes/js/events-manager.min.js?ver=6.4.1

Firefox detect it as windows-1252 encoding, which is wrong. In the middle of the content, you have "garbage":

=typeof $&&$.isArray||function(e){return"[object Array]"===Object.prototype.toString.call(e)},s={a:"[aá¸€á¸�Ä‚ÄƒÃ‚Ã¢Ç�ÇŽÈºâ±¥È¦È§áº áº¡Ã„Ã¤Ã€Ã Ã�Ã¡Ä€Ä�ÃƒÃ£Ã…Ã¥Ä…Ä„ÃƒÄ…Ä„]",b:"[bâ�¢Î²Î’Bà¸¿ð�Œ�á›’]",c:"[cÄ†Ä‡ÄˆÄ‰ÄŒÄ�ÄŠÄ‹CÌ„cÌ„Ã‡Ã§á¸ˆá¸‰È»È¼Æ‡ÆˆÉ•á´„ï¼£ï½ƒ]",d:"[dÄŽÄ�á¸Šá¸‹á¸�á¸‘á¸Œá¸�á¸’á¸“á¸Žá¸�Ä�Ä‘DÌ¦dÌ¦Æ‰É–ÆŠÉ—Æ‹ÆŒáµá¶�á¶‘È¡á´…ï¼¤ï½„Ã°]",e:"[eÃ‰Ã©ÃˆÃ¨ÃŠÃªá¸˜á¸™ÄšÄ›Ä”Ä•áº¼áº½á¸šá¸›áººáº»Ä–Ä—Ã‹Ã«Ä’Ä“È¨È©Ä˜Ä™á¶’É†É‡È„È…áº¾áº¿á»€á»�á»„á»…á»‚á»ƒá¸œá¸�á¸–á¸—á¸”á¸•È†È‡áº¸áº¹á»†á»‡â±¸á´‡ï¼¥ï½…É˜Ç�Æ�Æ�Îµ]",f:"[fÆ‘Æ’á¸žá¸Ÿ]",g:"[gÉ¢â‚²Ç¤Ç¥ÄœÄ�ÄžÄŸÄ¢Ä£Æ“É Ä Ä¡]",h:"[hÄ¤Ä¥Ä¦Ä§á¸¨á¸©áº–áº–á¸¤á¸¥á¸¢á¸£É¦Ê°Ç¶Æ•]",i:"[iÃ�ÃÃŒÃ¬Ä¬ÄÃŽÃ®Ç�Ç�Ã�Ã¯á¸®á¸¯Ä¨Ä©Ä®Ä¯ÄªÄ«á»ˆá»‰ÈˆÈ‰ÈŠÈ‹á»Šá»‹á¸¬á¸Æ—É¨É¨Ì†áµ»á¶–Ä°iIÄ±Éªï¼©ï½‰]",j:"[jÈ·Ä´ÄµÉˆÉ‰Ê�ÉŸÊ²]",k:"[kÆ˜Æ™ê�€ê��á¸°á¸±Ç¨Ç©á¸²á¸³á¸´á¸µÎºÏ°â‚]",l:"[lÅ�Å‚Ä½Ä¾Ä»Ä¼Ä¹Äºá¸¶á¸·á¸¸á¸¹á¸¼á¸½á¸ºá¸»Ä¿Å€È½Æšâ± â±¡â±¢É«É¬á¶…ÉÈ´ÊŸï¼¬ï½Œ]",n:"[nÅƒÅ„Ç¸Ç¹Å‡ÅˆÃ‘Ã±á¹„á¹…Å…Å†á¹†á¹‡á¹Šá¹‹á¹ˆá¹‰NÌˆnÌˆÆ�É²È Æžáµ°á¶‡É³ÈµÉ´ï¼®ï½ŽÅŠÅ‹]",o:"[oÃ˜Ã¸Ã–Ã¶Ã“Ã³Ã’Ã²Ã”Ã´Ç‘Ç’Å�Å‘ÅŽÅ�È®È¯á»Œá»�ÆŸÉµÆ Æ¡á»Žá»�ÅŒÅ�Ã•ÃµÇªÇ«ÈŒÈ�Õ•Ö…]",p:"[pá¹”á¹•á¹–á¹—â±£áµ½Æ¤Æ¥áµ±]",q:"[qê�–ê�—Ê ÉŠÉ‹ê�˜ê�™qÌƒ]",r:"[rÅ”Å•ÉŒÉ�Å˜Å™Å–Å—á¹˜á¹™È�È‘È’È“á¹šá¹›â±¤É½]",s:"[sÅšÅ›á¹ á¹¡á¹¢á¹£êž¨êž©ÅœÅ�Å Å¡ÅžÅŸÈ˜È™SÌˆsÌˆ]",t:"[tÅ¤Å¥á¹ªá¹«Å¢Å£á¹¬á¹Æ®ÊˆÈšÈ›á¹°á¹±á¹®á¹¯Æ¬Æ]",u:"[uÅ¬ÅÉ„Ê‰á»¤á»¥ÃœÃ¼ÃšÃºÃ™Ã¹Ã›Ã»Ç“Ç”Å°Å±Å¬ÅÆ¯Æ°á»¦á»§ÅªÅ«Å¨Å©Å²Å³È”È•âˆª]",v:"[vá¹¼á¹½á¹¾á¹¿Æ²Ê‹ê�žê�Ÿâ±±Ê‹]",w:"[wáº‚áºƒáº€áº�Å´Åµáº„áº…áº†áº‡áºˆáº‰]",x:"[xáºŒáº�áºŠáº‹Ï‡]",y:"[yÃ�Ã½á»²á»³Å¶Å·Å¸Ã¿á»¸á»¹áºŽáº�á»´á»µÉŽÉ�Æ³Æ´]",z:"[zÅ¹Åºáº�áº‘Å½Å¾Å»Å¼áº’áº“áº”áº•ÆµÆ¶]"},l=function(){var e,t,n,i,o="",a={};for(n in s)if(s.hasOwnProperty(n))for(o+=i=s[n].substring(2,s[n].length-1),e=0,t=i.length;e<t;e++)a[i.charAt(e)]=n;var r=new RegExp("["+o+"]","g");return function(e){return e.replace(r,(function(e){return a[e]})).toLowerCase()}}();return e})),f

You can ask firefox to repairs the encoding, and it will find utf-8 (which is correct).
Why it doesn't find 'utf-8' at first ? I don't know, maybe it tries only with the beginning of the content at first and check the whole content when it repairs. But anyway, if we do the same thing (decode as windows-1252 and then store the decoded content as utf-8), we will have broken js and no one will be able to recover from that.

benoit74 · 2024-02-14T11:06:11Z

Correct me if I'm wrong, I will try to summarize the situation. I consider we should split the problem in two parts.

The first part is to correctly process non-UTF-8 encodings like we have on Thales Docs or on https://tmp.kiwix.org/ci/test-website/chinese-encoding.html. Currently this is making the scraper fails, PR #183 allows to fix this. I think it is already an important step forward and we could merge it once ready, i.e. it covers properly the situations where the file contains only valid characters but in a non-UTF-8 encoding.

The second part and remaining question is what to do for a page like https://tmp.kiwix.org/ci/test-website/bad-encoding.html where:

the declared encoding is wrong (it is not saved in UTF-8 as declared in HTML meta tag but Windows 1252)
most of the file is in Windows 1252
two characters (four bytes) are not in Windows 1252 but Chinese GB2312
browsers display well most of the page content

benoit74 · 2024-02-14T12:54:03Z

I transferred the second part to #185

benoit74 assigned mgautierfr Feb 9, 2024

mgautierfr mentioned this issue Feb 13, 2024

Correctly detect encoding and decode bytes #183

Merged

benoit74 mentioned this issue Feb 14, 2024

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Closed

benoit74 closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure of `thalesdoc`: invalid UTF-8 start byte in an HTML file #177

Failure of `thalesdoc`: invalid UTF-8 start byte in an HTML file #177

benoit74 commented Feb 9, 2024

mgautierfr commented Feb 13, 2024

kelson42 commented Feb 13, 2024

benoit74 commented Feb 13, 2024

kelson42 commented Feb 14, 2024 •

edited

Loading

benoit74 commented Feb 14, 2024

kelson42 commented Feb 14, 2024 •

edited

Loading

mgautierfr commented Feb 14, 2024

kelson42 commented Feb 14, 2024

mgautierfr commented Feb 14, 2024

mgautierfr commented Feb 14, 2024

benoit74 commented Feb 14, 2024

benoit74 commented Feb 14, 2024

Failure of thalesdoc: invalid UTF-8 start byte in an HTML file #177

Failure of thalesdoc: invalid UTF-8 start byte in an HTML file #177

Comments

benoit74 commented Feb 9, 2024

mgautierfr commented Feb 13, 2024

kelson42 commented Feb 13, 2024

benoit74 commented Feb 13, 2024

kelson42 commented Feb 14, 2024 • edited Loading

benoit74 commented Feb 14, 2024

kelson42 commented Feb 14, 2024 • edited Loading

mgautierfr commented Feb 14, 2024

kelson42 commented Feb 14, 2024

mgautierfr commented Feb 14, 2024

mgautierfr commented Feb 14, 2024

benoit74 commented Feb 14, 2024

benoit74 commented Feb 14, 2024

Failure of `thalesdoc`: invalid UTF-8 start byte in an HTML file #177

Failure of `thalesdoc`: invalid UTF-8 start byte in an HTML file #177

kelson42 commented Feb 14, 2024 •

edited

Loading

kelson42 commented Feb 14, 2024 •

edited

Loading