-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphenated words flag as misspelled even when in spelling_word_list_filename #119
Comments
I think this has to do with the way the tokenizer extracts words from the text. It splits on whitespace and punctuation, so the hyphenated word ends up as 2 tokens. If you find a way to change that behavior in PyExpect, or the underlying expect library, I would be happy to talk about configuration options to expose it in this layer, or even making it the default. |
I've hit this issue, too.
I'm not clear about the PyExpect layer. I couldn't find anything about "hyphen" in their limited docs, whereas there is mention of tokenization in PyEnchant. Is this a typo? |
This is related to the same issue as #126 There's an option for the 'en' tokenizer to pass extra characters that should be considered part of a word. It's meant for apostrophes, but could be used for hyphens. It would apply to all words, though, so that may not be ideal in all cases. We could add a configuration option to expose that, default it to whatever the default is for PyEnchant, and then you could add a hyphen to the setting in your project. It's not clear if that option is part of the formal API for all tokenizers, though, and there does not appear to be a way to pass the character list to get_tokenizer(), so I don't know if we could rely on it always being available. Maybe one of you can do the research to determine how practical it is to add the configuration option? |
A quick test reveals success! I edited these lines in my virtual environment, by hard-coding and extending the tuples with a hyphen: https://github.com/pyenchant/pyenchant/blob/main/enchant/tokenize/en.py#L88-L100
I then added the word Can you point me to how would I go about adding this as a configuration option? Side note: make sure to set |
Woohoo!
https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/__init__.py#L34 is an example of defining a string option. The builder The checker creates and initializes the tokenizer at https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/checker.py#L33 I'm not sure how to set up the tokenizer to accept the additional argument. As I said in an earlier comment, it doesn't appear to be part of the public API for accessing a tokenizer. I'm not sure if we're intended to configure the tokenizer after it is instantiated? Perhaps @dmerejkowsky can offer some advice about that.
We could handle that a couple of ways. Long term, I don't think we want to require users to turn off smart quotes, because that would affect how their content is published. I see 2 problems to solve: the characters to have the tokenizer remove and the characters to have it keep. We would want the tokenizer to remove smart double quotes. I saw some logic for unicode processing in PyEnchant, but I didn't study it closely. Does it ignore smart quotes? If not, can we update it? Or give it an option that would cause it to? We could use the Sphinx config option for smart quote handling to control how we configure PyEnchant. For the characters we want to keep, we could conceivably add a smart hyphen and smart single quote (apostrophe) to the allowed characters list, if the ASCII versions are present in that list. The resulting words still might not match the custom word list, though, because the smart characters wouldn't match the ASCII characters. So, perhaps a filter is a better approach? I'll have to give that more thought. Let's focus on getting the extra characters down to the tokenizer, first. Once we have that, we at least have a work-around of disabling smart quotes, and that will give us time to work out a more elegant solution. |
Picking this up again in draft PR #147. Feedback appreciated.
I don't have any idea either. @dmerejkowsky do you have guidance? |
For example the word
multi-regional
is flagged by the spell checker asmulti
being misspelled.I would like to be able to add the whole of
multi-regional
into my wordlist file as allowed, but this does not work.Is this a known issue, and if so are there any workarounds? I've tried escaping the
-
in my wordlist file but it is still ignored.Thanks.
The text was updated successfully, but these errors were encountered: