Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure of thalesdoc: invalid UTF-8 start byte in an HTML file #177

Closed
benoit74 opened this issue Feb 9, 2024 · 12 comments · Fixed by #183
Closed

Failure of thalesdoc: invalid UTF-8 start byte in an HTML file #177

benoit74 opened this issue Feb 9, 2024 · 12 comments · Fixed by #183
Assignees

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Feb 9, 2024

With latest zimit2 image (including with warc2zim 2.0.0-dev3), we face a new issue (I processed locally the full crawl I recently uploaded at https://tmp.kiwix.org/ci/test-warc/thalesdoc_en_all_full_2024-02-04/) around invalid UTF-8 in an HTML file.

Last log lines:

[warc2zim::2024-02-09 10:44:49,443] WARNING:Css transformation fails. Fallback to regex rewriter.
Content is `border-left-style: solid;border-left-width: 1px;border-left-color: #c0c0c0;border-right-style: solid;border-right-width: 1px;border-right-color: #c0c0c0;border-top-style: solid;border-top-width: 1px;border-top-color: #c0c0c0;border-bottom-style: solid;border-bottom-width: 1px;border-bottom-color: #c0c0c0;w`
[warc2zim::2024-02-09 10:45:10,064] WARNING:Css transformation fails. Fallback to regex rewriter.
Content is `width: 100%; cellspacing=`
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/main.py", line 90, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 264, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 493, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/items.py", line 38, in __init__
    (self.title, self.content) = Rewriter(path, record, known_urls).rewrite(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 36, in rewrite
    return self.rewrite_html(head_template, css_insert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 72, in rewrite_html
    return HtmlRewriter(self.url_rewriter, head_insert, css_insert).rewrite(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/html.py", line 79, in rewrite
    content = to_string(content)
              ^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/utils.py", line 47, in to_string
    input_ = input_.decode(  # pyright: ignore[reportGeneralTypeIssues, reportAttributeAccessIssue]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/encodings/utf_8_sig.py", line 23, in decode
    (output, consumed) = codecs.utf_8_decode(input, errors, True)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 6224: invalid start byte
@mgautierfr
Copy link
Contributor

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 6224: invalid start byte

Seems like utf-8 is not the encoding of the content. We should not assume content is utf8 and detect how we must decode the content before rewriting.

@kelson42
Copy link
Contributor

If we crash each time we have and encoding problem (assuming we know what is the proper encoding), then we have a problem because this is a very common scenario. We have to be more flexible on this scenario.

@benoit74
Copy link
Collaborator Author

@kelson42 I'm not sure to get what you mean by "assuming we know what is the proper encoding" and "be more flexible"

The PR #183 is doing our best to guess what could be the encoding based on different strategies. If this fails, what do you expect the scraper behavior to be?

@kelson42
Copy link
Contributor

kelson42 commented Feb 14, 2024

The PR #183 is doing our best to guess what could be the encoding based on different strategies.

This is to detect the charset used, this is what I mean with "assuming we know what is the proper encoding".

But the rest of my comment applies: If we crash each time we have and encoding problem (...) then we have a problem because this is a very common scenario. We have to be more flexible on this scenario.

Using another adjective: we must be tolerant with text encoding errors.

@benoit74
Copy link
Collaborator Author

Thank you, it is way clearer.

How do you imagine we should proceed when decoding fails due to text encoding errors?

Simply not rewriting the HTML/CSS/JS and logging a clear warning is ok?

An alternative could be to split the content in chunks until we isolate all faulty bytes, then replace them by something 'identifiable', rewrite, and finally put the faulty part back in place. While this would provide value in some cases (it works well if the faulty part is some text in an HTML tag for instance), this looks pretty complex and won't work if the faulty part is some important code (HTML tag attribute names, JS code, ...)

@kelson42
Copy link
Contributor

kelson42 commented Feb 14, 2024

HTML parsing is not like XML parsing! Crashing and not rewriting is not an option. If you have a HTML parser which is not able to go over a few wrongly encoded characters, then you will have it change it I guess.

Check the story of XHTML ;)

@mgautierfr
Copy link
Contributor

The problem is not even not crashing when we cannot detect an encoding. I haven't found such a content. And we could "easily" loop over all known encoding to decode until we succeed. Or decoded as "utf-8" with a replacement char for unknown encoding.
And, as stopping the process is better to creating unusable zim, I tend to say we should stop (crash is a way to stop) if we cannot process a content.

The real problem, is succeeded to decode a content but still using a wrong encoded.
This a well know problem known from the beginning of the encoding (https://duckduckgo.com/?q=why+I+have+chinese+character+on+my+webpage&t=ffab&ia=web)
In this case, we have no way to detect the encoding is wrong and we have decoded to garbage.

And as we store this (unicode) garbage in utf-8 in the zim file, user will have no way to recover to the original content.

@kelson42
Copy link
Contributor

And, as stopping the process is better to creating unusable zim, I tend to say we should stop (crash is a way to stop) if we cannot process a content.

Absolutly, but of this happen too often for "stupid" reason, then this is also a problem. If it display properly in browsers, I see no reason why our HTML parser shozld fail.

@mgautierfr
Copy link
Contributor

If it display properly in browsers, I see no reason why our HTML parser shozld fail.

Browser do not display properly. If generate "replacement char". Then user can ask to change the encoding (which is possible as browser as access to the encoded bytes).

Once we have decoded the content (wrongly or not) and re-encoded it in utf-8, you cannot change the encoding when you read it.

@mgautierfr
Copy link
Contributor

See this content : https://community.mozilla.org/wp-content/plugins/events-manager/includes/js/events-manager.min.js?ver=6.4.1

Firefox detect it as windows-1252 encoding, which is wrong. In the middle of the content, you have "garbage":

=typeof $&&$.isArray||function(e){return"[object Array]"===Object.prototype.toString.call(e)},s={a:"[aḀá¸�ĂăÂâÇ�ǎȺⱥȦȧẠạÄäÀà Ã�áĀÄ�ÃãÅåąĄÃąĄ]",b:"[bâ�¢Î²Î’B฿ð�Œ�á›’]",c:"[cĆćĈĉČÄ�ÄŠÄ‹CÌ„c̄ÇçḈḉȻȼƇƈɕᴄCc]",d:"[dÄŽÄ�Ḋḋá¸�ḑḌá¸�ḒḓḎá¸�Ä�Ä‘D̦d̦ƉɖƊɗƋƌᵭá¶�ᶑȡᴅDdð]",e:"[eÉéÈèÊêḘḙĚěĔĕẼẽḚḛẺẻĖėËëĒēȨȩĘęᶒɆɇȄȅẾếỀá»�ỄễỂểḜá¸�ḖḗḔḕȆȇẸẹỆệⱸᴇEeɘÇ�Æ�Æ�ε]",f:"[fƑƒḞḟ]",g:"[gɢ₲ǤǥĜÄ�ÄžÄŸÄ¢Ä£Æ“É Ä Ä¡]",h:"[hĤĥĦħḨḩẖẖḤḥḢḣɦʰǶƕ]",i:"[iÃ�íÌìĬĭÎîÇ�Ç�Ã�ïḮḯĨĩĮįĪīỈỉȈȉȊȋỊịḬḭƗɨɨ̆ᵻᶖİiIıɪIi]",j:"[jȷĴĵɈɉÊ�ɟʲ]",k:"[kƘƙê�€ê��ḰḱǨǩḲḳḴḵκϰ₭]",l:"[lÅ�łĽľĻļĹĺḶḷḸḹḼḽḺḻĿŀȽƚⱠⱡⱢɫɬᶅɭȴʟLl]",n:"[nŃńǸǹŇňÑñṄṅŅņṆṇṊṋṈṉN̈n̈Æ�É²È Æžáµ°á¶‡É³ÈµÉ´ï¼®ï½ŽÅŠÅ‹]",o:"[oØøÖöÓóÒòÔôǑǒÅ�Å‘ÅŽÅ�ȮȯỌá»�ÆŸÉµÆ Æ¡á»Žá»�ÅŒÅ�ÕõǪǫȌÈ�Õ•Ö…]",p:"[pṔṕṖṗⱣᵽƤƥᵱ]",q:"[qê�–ê�—Ê ÉŠÉ‹ê�˜ê�™q̃]",r:"[rŔŕɌÉ�ŘřŖŗṘṙÈ�ȑȒȓṚṛⱤɽ]",s:"[sŚśṠṡṢṣꞨꞩŜÅ�ŠšŞşȘșS̈s̈]",t:"[tŤťṪṫŢţṬṭƮʈȚțṰṱṮṯƬƭ]",u:"[uŬŭɄʉỤụÜüÚúÙùÛûǓǔŰűŬŭƯưỦủŪūŨũŲųȔȕ∪]",v:"[vṼṽṾṿƲʋê�žê�Ÿâ±±Ê‹]",w:"[wẂẃẀáº�ŴŵẄẅẆẇẈẉ]",x:"[xẌáº�Ẋẋχ]",y:"[yÃ�ýỲỳŶŷŸÿỸỹẎáº�ỴỵɎÉ�Ƴƴ]",z:"[zŹźáº�ẑŽžŻżẒẓẔẕƵƶ]"},l=function(){var e,t,n,i,o="",a={};for(n in s)if(s.hasOwnProperty(n))for(o+=i=s[n].substring(2,s[n].length-1),e=0,t=i.length;e<t;e++)a[i.charAt(e)]=n;var r=new RegExp("["+o+"]","g");return function(e){return e.replace(r,(function(e){return a[e]})).toLowerCase()}}();return e})),f

You can ask firefox to repairs the encoding, and it will find utf-8 (which is correct).
Why it doesn't find 'utf-8' at first ? I don't know, maybe it tries only with the beginning of the content at first and check the whole content when it repairs. But anyway, if we do the same thing (decode as windows-1252 and then store the decoded content as utf-8), we will have broken js and no one will be able to recover from that.

@benoit74
Copy link
Collaborator Author

Correct me if I'm wrong, I will try to summarize the situation. I consider we should split the problem in two parts.

The first part is to correctly process non-UTF-8 encodings like we have on Thales Docs or on https://tmp.kiwix.org/ci/test-website/chinese-encoding.html. Currently this is making the scraper fails, PR #183 allows to fix this. I think it is already an important step forward and we could merge it once ready, i.e. it covers properly the situations where the file contains only valid characters but in a non-UTF-8 encoding.

The second part and remaining question is what to do for a page like https://tmp.kiwix.org/ci/test-website/bad-encoding.html where:

  • the declared encoding is wrong (it is not saved in UTF-8 as declared in HTML meta tag but Windows 1252)
  • most of the file is in Windows 1252
  • two characters (four bytes) are not in Windows 1252 but Chinese GB2312
  • browsers display well most of the page content

@benoit74
Copy link
Collaborator Author

I transferred the second part to #185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants