Correctly detect encoding and decode bytes #183

mgautierfr · 2024-02-13T10:32:01Z

This PR try to detect encoding using 3 methods:

From http headers in the warc record
From encoding declaration inside the content
From statistical analysis of the content (made by chardet)

codecov · 2024-02-13T13:10:24Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (39efc42) 87.08% compared to head (2a9c54b) 87.37%.
Report is 1 commits behind head on warc2zim2.

Files	Patch %	Lines
src/warc2zim/utils.py	97.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##           warc2zim2     #183      +/-   ##
=============================================
+ Coverage      87.08%   87.37%   +0.29%     
=============================================
  Files             13       13              
  Lines            867      895      +28     
  Branches         149      157       +8     
=============================================
+ Hits             755      782      +27     
  Misses            96       96              
- Partials          16       17       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74

Thank you.

Could you confirm you tested this change successfully on Thales full WARCs? (you do not mention it explicitly in the PR).

Please add sufficient testing and fix CI (it did not ran at first because you changed the base branch too quickly after creating the PR, I already had this bug, I just amended your last commit and pushed force to trigger the workflow).

I find the current behavior a bit misleading, especially compared to the PR description and the inline comments: if the Warc Record content encoding is wrong and there is a content encoding written in the file, we will only try this encoding. We won't try the one detected by chardet.

I think it would make more sense to:

first try the encoding of the Warc Record
if this fails or there is none, then try all the encodings found in the file
if this fails or there is none, then try all the encodings detected by chardet

Code should also be adapted to not "test" the same encoding twice (for a given content), here if Warc record and file give the same encoding, we will try twice the same failing operation. Same problem if a file declare twice the same encoding (shouldn't happen, but there is a low price to pay to not do twice the same operation).

rgaudin

It clearly lacks tests. We know this is a fragile area. Maybe we should add a test warc to ensure we're testing the overall process.

src/warc2zim/utils.py

rgaudin · 2024-02-13T14:50:27Z

Ah and I forgot the reason I looked at this 😅 The PR does not explains what we're trying to fix here. The ticket is just a report of warc2zim crashing with no answer.

Why is it failing? I guess we need to decode to rewrite it. Are we storing the result in original encoding or UTF-8 as ZIM spec requests?

mgautierfr · 2024-02-13T15:52:03Z

Could you confirm you tested this change successfully on Thales full WARCs? (you do not mention it explicitly in the PR).

It is yes.

I think it would make more sense to:

first try the encoding of the Warc Record

if this fails or there is none, then try all the encodings found in the file

if this fails or there is none, then try all the encodings detected by chardet

Specification says otherwise. We should first use the encoding declared in the header, then the one in the meta tag then heuristic.

But we may indeed try heuristics even if a (wrong) encoding is declared in the meta tag. I will update code.

mgautierfr · 2024-02-13T15:57:23Z

Ah and I forgot the reason I looked at this 😅 The PR does not explains what we're trying to fix here. The ticket is just a report of warc2zim crashing with no answer.

Why is it failing? I guess we need to decode to rewrite it. Are we storing the result in original encoding or UTF-8 as ZIM spec requests?

I have put a comment in the issue. The problem is that input content may not be encoded in utf8. We have to be "smart" to select the right encoding.

We still use utf8 encoding when we save the content in the zim file. This PR is just about decoding content in the warc.

benoit74

Thank you for these changes, we are moving in the right direction.

We still need few more tests (especially the one which makes the to_string and the scraper crash due to "mixed-encoding" in the file, you might reuse the content I built for the test website)

src/warc2zim/utils.py

benoit74

LGTM, @rgaudin could you make a second pass please?

rgaudin

LGTM ; a simple format suggestion

src/warc2zim/utils.py

This try to get the encoding from the record headers only.

- First try to use the declared encoding in headers (if available). - Then search for encoding declared at the beginning of content. - Finally use chardet to detect the content encoding.

Even chardet may return a invalid encoding. By looping on all potential encodings detected by chardet, we are more tolerant.

…set. We cannot fully trust the content here.

Doing this, I have added test (and fix) on declared encoding not existing.

mgautierfr · 2024-02-15T13:50:44Z

Last suggestion of @rgaudin applied. Rebased on last warc2zim2 branch.
Ready to be merged.

mgautierfr requested a review from benoit74 February 13, 2024 10:32

mgautierfr changed the base branch from main to warc2zim2 February 13, 2024 10:32

benoit74 force-pushed the decode_to_string branch from 71f6aa2 to 2d0f57a Compare February 13, 2024 13:07

benoit74 requested changes Feb 13, 2024

View reviewed changes

rgaudin requested changes Feb 13, 2024

View reviewed changes

src/warc2zim/utils.py Show resolved Hide resolved

benoit74 mentioned this pull request Feb 13, 2024

Failure of thalesdoc: invalid UTF-8 start byte in an HTML file #177

Closed

mgautierfr force-pushed the decode_to_string branch from 2d0f57a to ecac13a Compare February 14, 2024 09:34

mgautierfr requested review from rgaudin and benoit74 February 14, 2024 09:51

benoit74 requested changes Feb 14, 2024

View reviewed changes

src/warc2zim/utils.py Show resolved Hide resolved

src/warc2zim/utils.py Outdated Show resolved Hide resolved

src/warc2zim/utils.py Outdated Show resolved Hide resolved

src/warc2zim/utils.py Outdated Show resolved Hide resolved

src/warc2zim/utils.py Show resolved Hide resolved

benoit74 mentioned this pull request Feb 14, 2024

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Closed

mgautierfr requested a review from benoit74 February 14, 2024 14:27

benoit74 reviewed Feb 14, 2024

View reviewed changes

src/warc2zim/utils.py Show resolved Hide resolved

benoit74 approved these changes Feb 14, 2024

View reviewed changes

rgaudin approved these changes Feb 14, 2024

View reviewed changes

src/warc2zim/utils.py Outdated Show resolved Hide resolved

mgautierfr force-pushed the decode_to_string branch from 028a1d9 to 14d9cf9 Compare February 15, 2024 13:47

mgautierfr added 10 commits February 15, 2024 14:49

Url normalize to not take a bytes as input.

7dcb552

HtmlRewriter::rewrite do not take a bytes as input.

ba5a397

Introduce get_record_content_type as a helper function.

ffc27a1

Introduce get_record_encoding.

81a493d

This try to get the encoding from the record headers only.

Properly decode content.

5d8782d

- First try to use the declared encoding in headers (if available). - Then search for encoding declared at the beginning of content. - Finally use chardet to detect the content encoding.

Do not try to detect encoding of empty content.

0eb6eae

Try several encoding.

6aef282

Even chardet may return a invalid encoding. By looping on all potential encodings detected by chardet, we are more tolerant.

Use heuristics to detect the encoding even if we have a declared char…

0dca2eb

…set. We cannot fully trust the content here.

Do not try the same encoding twice.

99e9e2b

Add tests on encoding detection (to_string).

2a9c54b

Doing this, I have added test (and fix) on declared encoding not existing.

mgautierfr force-pushed the decode_to_string branch from 14d9cf9 to 2a9c54b Compare February 15, 2024 13:49

benoit74 merged commit 6419d00 into warc2zim2 Feb 16, 2024
6 checks passed

benoit74 deleted the decode_to_string branch February 16, 2024 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly detect encoding and decode bytes #183

Correctly detect encoding and decode bytes #183

mgautierfr commented Feb 13, 2024

codecov bot commented Feb 13, 2024 •

edited

Loading

benoit74 left a comment

rgaudin left a comment

rgaudin commented Feb 13, 2024

mgautierfr commented Feb 13, 2024 •

edited

Loading

mgautierfr commented Feb 13, 2024

benoit74 left a comment

benoit74 left a comment

rgaudin left a comment

mgautierfr commented Feb 15, 2024

Correctly detect encoding and decode bytes #183

Correctly detect encoding and decode bytes #183

Conversation

mgautierfr commented Feb 13, 2024

codecov bot commented Feb 13, 2024 • edited Loading

Codecov Report

benoit74 left a comment

Choose a reason for hiding this comment

rgaudin left a comment

Choose a reason for hiding this comment

rgaudin commented Feb 13, 2024

mgautierfr commented Feb 13, 2024 • edited Loading

mgautierfr commented Feb 13, 2024

benoit74 left a comment

Choose a reason for hiding this comment

benoit74 left a comment

Choose a reason for hiding this comment

rgaudin left a comment

Choose a reason for hiding this comment

mgautierfr commented Feb 15, 2024

codecov bot commented Feb 13, 2024 •

edited

Loading

mgautierfr commented Feb 13, 2024 •

edited

Loading