Fix detection of encoding again #314

benoit74 · 2024-06-14T07:48:34Z

Changes

first try to find charset in document first 1024 bytes with the ascii or UTF-16 or UTF-32 encoding and regex ; if found, decode with this encoding in 'replace' mode (always going to work, do not care if encoding specified was wrong, not our fault)
- NOTA: or UTF-16 or UTF-32 is a slight change compared to issue proposal because with UTF-16 and UTF-32 first bytes cannot be decoded to ascii properly (they always use at least 16 bytes or 32 bytes even for ascii characters)
if no charset in found in document first 1024 bytes, search for charset in HTTP Content-Type header ; if found, decode with this encoding in 'replace' mode (always going to work, do not care if encoding specified was wrong, not our fault)
if no charset found in document or HTTP header, we will have to guess a little bit (situation is quite common for JS and CSS); try UTF-8 in 'strict' mode ; if it fails, try ISO-8859-1 in 'strict' mode ; if it fails, stop the scraper (we cannot guess encoding when not specified in document or HTTP header) ; this list of charset to try should even probably be exposed as a scraper parameter --guessing-charsets so that it could be possible when needed to tweak this list when needed

codecov · 2024-06-14T07:50:42Z

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 84.49%. Comparing base (4c12681) to head (9933304).

❗ Current head 9933304 differs from pull request most recent head b1c8a35

Please upload reports for the commit b1c8a35 to get more accurate results.

Files	Patch %	Lines
src/warc2zim/utils.py	91.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #314      +/-   ##
==========================================
+ Coverage   84.06%   84.49%   +0.42%     
==========================================
  Files          14       14              
  Lines        1268     1238      -30     
  Branches      249      245       -4     
==========================================
- Hits         1066     1046      -20     
+ Misses        155      149       -6     
+ Partials       47       43       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74 · 2024-06-14T15:26:32Z

CodeFactor is complaining about files which should be ignored since they are outside our scope, these are sample files from online websites.

rgaudin · 2024-06-14T15:31:47Z

Can you add exceptions then? Either inline, via a config file or via Codefactor UI

rgaudin

Can we rename guessed_charsets with charsets_to_try ?

Good luck with Codefactor 😀

benoit74 · 2024-06-14T15:48:15Z

Can we rename guessed_charsets with charsets_to_try ?

Sure

Good luck with Codefactor 😀

I will kill it ^^ It worked once, I supposed my last modification was ok so I cleaned up everything I've left over, it's not working anymore, I rolled-back to what was working, it is not working anymore. But I will nail it ^^

benoit74 · 2024-06-14T16:10:48Z

Codefactor issues its configuration from main branch, so you have to merge first to main before you can review the PR ...

rgaudin · 2024-06-14T16:19:05Z

Go ahead

kelson42 · 2024-06-14T17:47:07Z

@benoit74 Bravo... and bon courage!

benoit74 · 2024-06-14T18:09:15Z

Go ahead

I already pushed only the configuration to main branch. My comment was more for "the posterity". I still have to rewrite this branch to change arg name + simplify commits.

Bravo... and bon courage!

We will finish by nailing this down ^^

…rsets to try

benoit74 self-assigned this Jun 14, 2024

benoit74 force-pushed the characters_encoding branch from 09c6b58 to 96a31ae Compare June 14, 2024 07:54

benoit74 mentioned this pull request Jun 14, 2024

Automated encoding detection is still not working properly #312

Closed

benoit74 force-pushed the characters_encoding branch from 96a31ae to 22c3113 Compare June 14, 2024 08:57

benoit74 marked this pull request as ready for review June 14, 2024 15:26

benoit74 requested a review from rgaudin June 14, 2024 15:26

benoit74 force-pushed the characters_encoding branch 5 times, most recently from e91bb68 to 15b5cf3 Compare June 14, 2024 15:44

rgaudin approved these changes Jun 14, 2024

View reviewed changes

benoit74 force-pushed the characters_encoding branch from 15b5cf3 to 9ba14e8 Compare June 14, 2024 15:45

benoit74 force-pushed the characters_encoding branch 13 times, most recently from 9ed16d9 to a1f6fdd Compare June 14, 2024 16:06

benoit74 force-pushed the main branch from c5c0b17 to 4c12681 Compare June 14, 2024 16:08

benoit74 force-pushed the characters_encoding branch 3 times, most recently from 9e7dcd6 to 9933304 Compare June 14, 2024 16:10

Decode content bytes only with supplied charset or static list of cha…

b1c8a35

…rsets to try

benoit74 force-pushed the characters_encoding branch from 9933304 to b1c8a35 Compare June 17, 2024 07:25

benoit74 merged commit 4abf952 into main Jun 17, 2024
5 checks passed

benoit74 deleted the characters_encoding branch June 17, 2024 07:27

benoit74 mentioned this pull request Jul 26, 2024

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside #185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix detection of encoding again #314

Fix detection of encoding again #314

benoit74 commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading

benoit74 commented Jun 14, 2024

rgaudin commented Jun 14, 2024

rgaudin left a comment

benoit74 commented Jun 14, 2024

benoit74 commented Jun 14, 2024

rgaudin commented Jun 14, 2024

kelson42 commented Jun 14, 2024

benoit74 commented Jun 14, 2024

Fix detection of encoding again #314

Fix detection of encoding again #314

Conversation

benoit74 commented Jun 14, 2024 • edited Loading

codecov bot commented Jun 14, 2024 • edited Loading

Codecov Report

benoit74 commented Jun 14, 2024

rgaudin commented Jun 14, 2024

rgaudin left a comment

Choose a reason for hiding this comment

benoit74 commented Jun 14, 2024

benoit74 commented Jun 14, 2024

rgaudin commented Jun 14, 2024

kelson42 commented Jun 14, 2024

benoit74 commented Jun 14, 2024

benoit74 commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading