-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure of thalesdoc
: invalid UTF-8 start byte in an HTML file
#177
Comments
Seems like |
If we crash each time we have and encoding problem (assuming we know what is the proper encoding), then we have a problem because this is a very common scenario. We have to be more flexible on this scenario. |
This is to detect the charset used, this is what I mean with "assuming we know what is the proper encoding". But the rest of my comment applies: If we crash each time we have and encoding problem (...) then we have a problem because this is a very common scenario. We have to be more flexible on this scenario. Using another adjective: we must be tolerant with text encoding errors. |
Thank you, it is way clearer. How do you imagine we should proceed when decoding fails due to text encoding errors? Simply not rewriting the HTML/CSS/JS and logging a clear warning is ok? An alternative could be to split the content in chunks until we isolate all faulty bytes, then replace them by something 'identifiable', rewrite, and finally put the faulty part back in place. While this would provide value in some cases (it works well if the faulty part is some text in an HTML tag for instance), this looks pretty complex and won't work if the faulty part is some important code (HTML tag attribute names, JS code, ...) |
HTML parsing is not like XML parsing! Crashing and not rewriting is not an option. If you have a HTML parser which is not able to go over a few wrongly encoded characters, then you will have it change it I guess. Check the story of XHTML ;) |
The problem is not even not crashing when we cannot detect an encoding. I haven't found such a content. And we could "easily" loop over all known encoding to decode until we succeed. Or decoded as "utf-8" with a replacement char for unknown encoding. The real problem, is succeeded to decode a content but still using a wrong encoded. And as we store this (unicode) garbage in utf-8 in the zim file, user will have no way to recover to the original content. |
Absolutly, but of this happen too often for "stupid" reason, then this is also a problem. If it display properly in browsers, I see no reason why our HTML parser shozld fail. |
Browser do not display properly. If generate "replacement char". Then user can ask to change the encoding (which is possible as browser as access to the encoded bytes). Once we have decoded the content (wrongly or not) and re-encoded it in utf-8, you cannot change the encoding when you read it. |
See this content : https://community.mozilla.org/wp-content/plugins/events-manager/includes/js/events-manager.min.js?ver=6.4.1 Firefox detect it as
You can ask firefox to repairs the encoding, and it will find utf-8 (which is correct). |
Correct me if I'm wrong, I will try to summarize the situation. I consider we should split the problem in two parts. The first part is to correctly process non-UTF-8 encodings like we have on Thales Docs or on https://tmp.kiwix.org/ci/test-website/chinese-encoding.html. Currently this is making the scraper fails, PR #183 allows to fix this. I think it is already an important step forward and we could merge it once ready, i.e. it covers properly the situations where the file contains only valid characters but in a non-UTF-8 encoding. The second part and remaining question is what to do for a page like https://tmp.kiwix.org/ci/test-website/bad-encoding.html where:
|
I transferred the second part to #185 |
With latest zimit2 image (including with warc2zim 2.0.0-dev3), we face a new issue (I processed locally the full crawl I recently uploaded at https://tmp.kiwix.org/ci/test-warc/thalesdoc_en_all_full_2024-02-04/) around invalid UTF-8 in an HTML file.
Last log lines:
The text was updated successfully, but these errors were encountered: