Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly detect encoding and decode bytes #183

Merged
merged 10 commits into from
Feb 16, 2024
Merged

Conversation

mgautierfr
Copy link
Contributor

This PR try to detect encoding using 3 methods:

  • From http headers in the warc record
  • From encoding declaration inside the content
  • From statistical analysis of the content (made by chardet)

Fix #177

@mgautierfr mgautierfr changed the base branch from main to warc2zim2 February 13, 2024 10:32
Copy link

codecov bot commented Feb 13, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (39efc42) 87.08% compared to head (2a9c54b) 87.37%.
Report is 1 commits behind head on warc2zim2.

Files Patch % Lines
src/warc2zim/utils.py 97.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           warc2zim2     #183      +/-   ##
=============================================
+ Coverage      87.08%   87.37%   +0.29%     
=============================================
  Files             13       13              
  Lines            867      895      +28     
  Branches         149      157       +8     
=============================================
+ Hits             755      782      +27     
  Misses            96       96              
- Partials          16       17       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Could you confirm you tested this change successfully on Thales full WARCs? (you do not mention it explicitly in the PR).

Please add sufficient testing and fix CI (it did not ran at first because you changed the base branch too quickly after creating the PR, I already had this bug, I just amended your last commit and pushed force to trigger the workflow).

I find the current behavior a bit misleading, especially compared to the PR description and the inline comments: if the Warc Record content encoding is wrong and there is a content encoding written in the file, we will only try this encoding. We won't try the one detected by chardet.

I think it would make more sense to:

  • first try the encoding of the Warc Record
  • if this fails or there is none, then try all the encodings found in the file
  • if this fails or there is none, then try all the encodings detected by chardet

Code should also be adapted to not "test" the same encoding twice (for a given content), here if Warc record and file give the same encoding, we will try twice the same failing operation. Same problem if a file declare twice the same encoding (shouldn't happen, but there is a low price to pay to not do twice the same operation).

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It clearly lacks tests. We know this is a fragile area. Maybe we should add a test warc to ensure we're testing the overall process.

src/warc2zim/utils.py Show resolved Hide resolved
@rgaudin
Copy link
Member

rgaudin commented Feb 13, 2024

Ah and I forgot the reason I looked at this 😅 The PR does not explains what we're trying to fix here. The ticket is just a report of warc2zim crashing with no answer.

Why is it failing? I guess we need to decode to rewrite it. Are we storing the result in original encoding or UTF-8 as ZIM spec requests?

@mgautierfr
Copy link
Contributor Author

mgautierfr commented Feb 13, 2024

Could you confirm you tested this change successfully on Thales full WARCs? (you do not mention it explicitly in the PR).

It is yes.

I think it would make more sense to:

  • first try the encoding of the Warc Record
  • if this fails or there is none, then try all the encodings found in the file
  • if this fails or there is none, then try all the encodings detected by chardet

Specification says otherwise. We should first use the encoding declared in the header, then the one in the meta tag then heuristic.

But we may indeed try heuristics even if a (wrong) encoding is declared in the meta tag. I will update code.

@mgautierfr
Copy link
Contributor Author

Ah and I forgot the reason I looked at this 😅 The PR does not explains what we're trying to fix here. The ticket is just a report of warc2zim crashing with no answer.

Why is it failing? I guess we need to decode to rewrite it. Are we storing the result in original encoding or UTF-8 as ZIM spec requests?

I have put a comment in the issue. The problem is that input content may not be encoded in utf8. We have to be "smart" to select the right encoding.

We still use utf8 encoding when we save the content in the zim file. This PR is just about decoding content in the warc.

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these changes, we are moving in the right direction.

We still need few more tests (especially the one which makes the to_string and the scraper crash due to "mixed-encoding" in the file, you might reuse the content I built for the test website)

src/warc2zim/utils.py Show resolved Hide resolved
src/warc2zim/utils.py Outdated Show resolved Hide resolved
src/warc2zim/utils.py Outdated Show resolved Hide resolved
src/warc2zim/utils.py Outdated Show resolved Hide resolved
src/warc2zim/utils.py Show resolved Hide resolved
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @rgaudin could you make a second pass please?

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ; a simple format suggestion

src/warc2zim/utils.py Outdated Show resolved Hide resolved
This try to get the encoding from the record headers only.
- First try to use the declared encoding in headers (if available).
- Then search for encoding declared at the beginning of content.
- Finally use chardet to detect the content encoding.
Even chardet may return a invalid encoding.

By looping on all potential encodings detected by chardet, we are more
tolerant.
…set.

We cannot fully trust the content here.
Doing this, I have added test (and fix) on declared encoding not existing.
@mgautierfr
Copy link
Contributor Author

Last suggestion of @rgaudin applied. Rebased on last warc2zim2 branch.
Ready to be merged.

@benoit74 benoit74 merged commit 6419d00 into warc2zim2 Feb 16, 2024
6 checks passed
@benoit74 benoit74 deleted the decode_to_string branch February 16, 2024 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure of thalesdoc: invalid UTF-8 start byte in an HTML file
3 participants