Detects, preserve encoding when revising files #62

falquaddoomi · 2024-10-09T17:26:36Z

This PR addresses issue #61, in which a user reported errors with reading in GBK-encoded files, e.g. for representing Chinese characters. To address the issue, I first use chardet.detect() on each input file, then use that resulting encoding when reading and writing the file. The PR includes a test that a few GBK-encoded characters make it through the revision process.

This PR introduces a dependency on chardet in order to detect the encodings of input files.

EDIT: In the process of having the PR reviewed, we decided to switch to charset_normalizer, which offers similar functionality to chardet but with greater accuracy and speed. Thanks @d33bs for the suggestion!

Closes #61.

…revising files, preserving it in the output files.

… for characters in the result.

falquaddoomi · 2024-10-09T17:50:44Z

libs/manubot_ai_editor/editor.py

-        with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
+        # detect the input file encoding using chardet
+        # maintain that encoding when reading and writing files
+        src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]


FYI, I'm currently looking into how much of each file we need to read. Reading the entire thing is the safest choice, but chardet might be able to accurately detect the encoding with less data.

Alternatively, I may switch to using UniversalDetector (https://chardet.readthedocs.io/en/latest/usage.html#advanced-usage); I'll have to experiment with the confidence level we should use before stopping. (FWIW, this would all be a lot easier to decide if we had the failing markdown files, so hopefully we'll get those soon.)

ok，I Know. I have revised the line 283：

with open(input_filepath, "r",encoding='utf-8') as infile, open(output_filepath, "w",encoding='utf-8') as outfile:

d33bs

Nice job! This looked like a nice change to address the issue which was brought up. I left a few comments about various considerations. Additionally, you might consider pulling in the most recent changes from main, which I believe would allow this PR to observe and document passing tests prior to a merge.

setup.py

tests/test_editor.py

libs/manubot_ai_editor/editor.py

d33bs · 2024-11-06T23:31:45Z

libs/manubot_ai_editor/editor.py

-        with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
+        # detect the input file encoding using chardet
+        # maintain that encoding when reading and writing files
+        src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]


I'm unsure how large the files passed in here might be. Would it make sense to consider reading only a portion of the file if it were very large to help conserve time?

I thought about that (see #62 (comment)), but these files are rarely larger than a few kilobytes, since they're the text of sections of a paper, not binary files. I concluded that the (small, granted) extra engineering effort wasn't worth a difference of milliseconds, especially when it could decrease the accuracy of the encoding detection. Also, the revision process itself takes on the order of seconds to complete since it relies on an external API, so improvements here would IMHO go unnoticed.

I'm of course willing to revise my opinion if we really do have large files that need to be detected or if the difference would be significant. I think your suggestion of using charset_normalizer would also improve both speed and accuracy, so perhaps it'll be enough to switch to it.

FYI, I ended up switching to charset_normalizer in e31679e, which seems to be working well and is definitely faster.

This statement in their README gave me brief pause, though:

I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.

My goal is to preserve the source encoding, not just get readable text, although perhaps those two goals aren't so different. I added a somewhat more comprehensive test of Chinese character encoding and didn't find any issues, so I'm going to assume things are ok for now.

…nv vars to specify src/dest encoding manually. Other minor touchups.

…g specification vs. autodetection.

…avior to README

falquaddoomi added 2 commits October 9, 2024 11:21

Adds chardet==5.2.0. Uses chardet to detect input file encoding when …

7133118

…revising files, preserving it in the output files.

Adds a test for revising GBK-encoded file w/Chinese characters, looks…

93cb08f

… for characters in the result.

falquaddoomi commented Oct 9, 2024

View reviewed changes

falquaddoomi requested a review from d33bs November 6, 2024 21:04

d33bs approved these changes Nov 6, 2024

View reviewed changes

falquaddoomi added 3 commits November 12, 2024 16:30

Switches from chardet to charset_normalizer. Adds SRC/DEST_ENCODING e…

e31679e

…nv vars to specify src/dest encoding manually. Other minor touchups.

Adds CN lorem ipsum to test manuscript. Adds test of src/dest encodin…

84c67cd

…g specification vs. autodetection.

Added new env-vars doc, linked in README. Added desc. of encoding beh…

28ee615

…avior to README

This was referenced Nov 20, 2024

Improve logging #66

Open

Improve, simplify tests #67

Open

Merge branch 'main' into issue-61-gbk-decoding

2cb2baf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detects, preserve encoding when revising files #62

Detects, preserve encoding when revising files #62

falquaddoomi commented Oct 9, 2024 •

edited

Loading

falquaddoomi Oct 9, 2024 •

edited

Loading

shanshen123654789 Oct 10, 2024

d33bs left a comment

d33bs Nov 6, 2024

falquaddoomi Nov 7, 2024

falquaddoomi Nov 20, 2024

Detects, preserve encoding when revising files #62

Are you sure you want to change the base?

Detects, preserve encoding when revising files #62

Conversation

falquaddoomi commented Oct 9, 2024 • edited Loading

falquaddoomi Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

shanshen123654789 Oct 10, 2024

Choose a reason for hiding this comment

d33bs left a comment

Choose a reason for hiding this comment

d33bs Nov 6, 2024

Choose a reason for hiding this comment

falquaddoomi Nov 7, 2024

Choose a reason for hiding this comment

falquaddoomi Nov 20, 2024

Choose a reason for hiding this comment

falquaddoomi commented Oct 9, 2024 •

edited

Loading

falquaddoomi Oct 9, 2024 •

edited

Loading