-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly detect encoding and decode bytes #183
Conversation
71f6aa2
to
2d0f57a
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## warc2zim2 #183 +/- ##
=============================================
+ Coverage 87.08% 87.37% +0.29%
=============================================
Files 13 13
Lines 867 895 +28
Branches 149 157 +8
=============================================
+ Hits 755 782 +27
Misses 96 96
- Partials 16 17 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Could you confirm you tested this change successfully on Thales full WARCs? (you do not mention it explicitly in the PR).
Please add sufficient testing and fix CI (it did not ran at first because you changed the base branch too quickly after creating the PR, I already had this bug, I just amended your last commit and pushed force to trigger the workflow).
I find the current behavior a bit misleading, especially compared to the PR description and the inline comments: if the Warc Record content encoding is wrong and there is a content encoding written in the file, we will only try this encoding. We won't try the one detected by chardet.
I think it would make more sense to:
- first try the encoding of the Warc Record
- if this fails or there is none, then try all the encodings found in the file
- if this fails or there is none, then try all the encodings detected by chardet
Code should also be adapted to not "test" the same encoding twice (for a given content), here if Warc record and file give the same encoding, we will try twice the same failing operation. Same problem if a file declare twice the same encoding (shouldn't happen, but there is a low price to pay to not do twice the same operation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It clearly lacks tests. We know this is a fragile area. Maybe we should add a test warc to ensure we're testing the overall process.
Ah and I forgot the reason I looked at this 😅 The PR does not explains what we're trying to fix here. The ticket is just a report of warc2zim crashing with no answer. Why is it failing? I guess we need to decode to rewrite it. Are we storing the result in original encoding or UTF-8 as ZIM spec requests? |
It is yes.
Specification says otherwise. We should first use the encoding declared in the header, then the one in the meta tag then heuristic. But we may indeed try heuristics even if a (wrong) encoding is declared in the meta tag. I will update code. |
I have put a comment in the issue. The problem is that input content may not be encoded in utf8. We have to be "smart" to select the right encoding. We still use utf8 encoding when we save the content in the zim file. This PR is just about decoding content in the warc. |
2d0f57a
to
ecac13a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these changes, we are moving in the right direction.
We still need few more tests (especially the one which makes the to_string and the scraper crash due to "mixed-encoding" in the file, you might reuse the content I built for the test website)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @rgaudin could you make a second pass please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ; a simple format suggestion
028a1d9
to
14d9cf9
Compare
This try to get the encoding from the record headers only.
- First try to use the declared encoding in headers (if available). - Then search for encoding declared at the beginning of content. - Finally use chardet to detect the content encoding.
Even chardet may return a invalid encoding. By looping on all potential encodings detected by chardet, we are more tolerant.
…set. We cannot fully trust the content here.
Doing this, I have added test (and fix) on declared encoding not existing.
14d9cf9
to
2a9c54b
Compare
Last suggestion of @rgaudin applied. Rebased on last |
This PR try to detect encoding using 3 methods:
Fix #177