Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix detection of encoding again #314

Merged
merged 1 commit into from
Jun 17, 2024
Merged

Fix detection of encoding again #314

merged 1 commit into from
Jun 17, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Jun 14, 2024

Fix #312

Changes

  • first try to find charset in document first 1024 bytes with the ascii or UTF-16 or UTF-32 encoding and regex ; if found, decode with this encoding in 'replace' mode (always going to work, do not care if encoding specified was wrong, not our fault)
    • NOTA: or UTF-16 or UTF-32 is a slight change compared to issue proposal because with UTF-16 and UTF-32 first bytes cannot be decoded to ascii properly (they always use at least 16 bytes or 32 bytes even for ascii characters)
  • if no charset in found in document first 1024 bytes, search for charset in HTTP Content-Type header ; if found, decode with this encoding in 'replace' mode (always going to work, do not care if encoding specified was wrong, not our fault)
  • if no charset found in document or HTTP header, we will have to guess a little bit (situation is quite common for JS and CSS); try UTF-8 in 'strict' mode ; if it fails, try ISO-8859-1 in 'strict' mode ; if it fails, stop the scraper (we cannot guess encoding when not specified in document or HTTP header) ; this list of charset to try should even probably be exposed as a scraper parameter --guessing-charsets so that it could be possible when needed to tweak this list when needed

@benoit74 benoit74 self-assigned this Jun 14, 2024
Copy link

codecov bot commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 84.49%. Comparing base (4c12681) to head (9933304).

Current head 9933304 differs from pull request most recent head b1c8a35

Please upload reports for the commit b1c8a35 to get more accurate results.

Files Patch % Lines
src/warc2zim/utils.py 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #314      +/-   ##
==========================================
+ Coverage   84.06%   84.49%   +0.42%     
==========================================
  Files          14       14              
  Lines        1268     1238      -30     
  Branches      249      245       -4     
==========================================
- Hits         1066     1046      -20     
+ Misses        155      149       -6     
+ Partials       47       43       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74
Copy link
Collaborator Author

CodeFactor is complaining about files which should be ignored since they are outside our scope, these are sample files from online websites.

@benoit74 benoit74 requested a review from rgaudin June 14, 2024 15:26
@rgaudin
Copy link
Member

rgaudin commented Jun 14, 2024

Can you add exceptions then? Either inline, via a config file or via Codefactor UI

@benoit74 benoit74 force-pushed the characters_encoding branch 5 times, most recently from e91bb68 to 15b5cf3 Compare June 14, 2024 15:44
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename guessed_charsets with charsets_to_try ?

Good luck with Codefactor 😀

@benoit74
Copy link
Collaborator Author

Can we rename guessed_charsets with charsets_to_try ?

Sure

Good luck with Codefactor 😀

I will kill it ^^ It worked once, I supposed my last modification was ok so I cleaned up everything I've left over, it's not working anymore, I rolled-back to what was working, it is not working anymore. But I will nail it ^^

@benoit74 benoit74 force-pushed the characters_encoding branch 13 times, most recently from 9ed16d9 to a1f6fdd Compare June 14, 2024 16:06
@benoit74
Copy link
Collaborator Author

Codefactor issues its configuration from main branch, so you have to merge first to main before you can review the PR ...

@rgaudin
Copy link
Member

rgaudin commented Jun 14, 2024

Go ahead

@kelson42
Copy link
Contributor

@benoit74 Bravo... and bon courage!

@benoit74
Copy link
Collaborator Author

Go ahead

I already pushed only the configuration to main branch. My comment was more for "the posterity". I still have to rewrite this branch to change arg name + simplify commits.

Bravo... and bon courage!

We will finish by nailing this down ^^

@benoit74 benoit74 merged commit 4abf952 into main Jun 17, 2024
5 checks passed
@benoit74 benoit74 deleted the characters_encoding branch June 17, 2024 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automated encoding detection is still not working properly
3 participants