Hyphenated words flag as misspelled even when in spelling_word_list_filename #119

ssenior45 · 2021-02-09T16:03:21Z

For example the word multi-regional is flagged by the spell checker as multi being misspelled.

I would like to be able to add the whole of multi-regional into my wordlist file as allowed, but this does not work.

Is this a known issue, and if so are there any workarounds? I've tried escaping the - in my wordlist file but it is still ignored.

Thanks.

The text was updated successfully, but these errors were encountered:

dhellmann · 2021-02-09T16:50:08Z

I think this has to do with the way the tokenizer extracts words from the text. It splits on whitespace and punctuation, so the hyphenated word ends up as 2 tokens.

If you find a way to change that behavior in PyExpect, or the underlying expect library, I would be happy to talk about configuration options to expose it in this layer, or even making it the default.

stevepiercy · 2021-09-12T13:10:42Z

I've hit this issue, too.

If you find a way to change that behavior in PyExpect, or the underlying expect library, I would be happy to talk about configuration options to expose it in this layer, or even making it the default.

I'm not clear about the PyExpect layer. I couldn't find anything about "hyphen" in their limited docs, whereas there is mention of tokenization in PyEnchant. Is this a typo?

dhellmann · 2021-09-12T15:50:52Z

This is related to the same issue as #126

There's an option for the 'en' tokenizer to pass extra characters that should be considered part of a word. It's meant for apostrophes, but could be used for hyphens. It would apply to all words, though, so that may not be ideal in all cases. We could add a configuration option to expose that, default it to whatever the default is for PyEnchant, and then you could add a hyphen to the setting in your project.

It's not clear if that option is part of the formal API for all tokenizers, though, and there does not appear to be a way to pass the character list to get_tokenizer(), so I don't know if we could rely on it always being available.

Maybe one of you can do the research to determine how practical it is to add the configuration option?

stevepiercy · 2021-09-12T19:54:01Z

A quick test reveals success!

I edited these lines in my virtual environment, by hard-coding and extending the tuples with a hyphen:

https://github.com/pyenchant/pyenchant/blob/main/enchant/tokenize/en.py#L88-L100

    def _initialize_for_binary(self):
        self._consume_alpha = self._consume_alpha_b
        if self._valid_chars is None:
            self._valid_chars = ("'","-",)

    def _initialize_for_unicode(self):
        self._consume_alpha = self._consume_alpha_u
        if self._valid_chars is None:
            # XXX TODO: this doesn't seem to work correctly with the
            # MySpell provider, disabling for now.
            # Allow unicode typographic apostrophe
            # self._valid_chars = (u"'",u"\u2019")
            self._valid_chars = ("'","-",)

I then added the word built-in to my spelling_wordlist.txt, ran make spellcheck, and it no longer shows up as a misspelling.

Can you point me to how would I go about adding this as a configuration option?

Side note: make sure to set smartquotes=False in conf.py (default is True, if not set), else single-quotes and hyphens get converted to their "smart" typographical equivalents before spellchecker runs, flagging words such as "you've" and "doesn't". I did not see this mentioned in the docs when I did a quick search. Would this be a good thing to include in the docs?

dhellmann · 2021-09-13T16:32:18Z

A quick test reveals success!

Woohoo!

Can you point me to how would I go about adding this as a configuration option?

https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/__init__.py#L34 is an example of defining a string option.

The builder init() function is the layer that knows about how to access the Sphinx configuration options and pass them to the checker (which wraps PyEnchant), so you'll want to add a new argument to the checker and pass the config option through. https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/builder.py#L39

The checker creates and initializes the tokenizer at https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/checker.py#L33

I'm not sure how to set up the tokenizer to accept the additional argument. As I said in an earlier comment, it doesn't appear to be part of the public API for accessing a tokenizer. I'm not sure if we're intended to configure the tokenizer after it is instantiated? Perhaps @dmerejkowsky can offer some advice about that.

Side note: make sure to set smartquotes=False in conf.py (default is True, if not set), else single-quotes and hyphens get converted to their "smart" typographical equivalents before spellchecker runs, flagging words such as "you've" and "doesn't". I did not see this mentioned in the docs when I did a quick search. Would this be a good thing to include in the docs?

We could handle that a couple of ways. Long term, I don't think we want to require users to turn off smart quotes, because that would affect how their content is published. I see 2 problems to solve: the characters to have the tokenizer remove and the characters to have it keep.

We would want the tokenizer to remove smart double quotes. I saw some logic for unicode processing in PyEnchant, but I didn't study it closely. Does it ignore smart quotes? If not, can we update it? Or give it an option that would cause it to? We could use the Sphinx config option for smart quote handling to control how we configure PyEnchant.

For the characters we want to keep, we could conceivably add a smart hyphen and smart single quote (apostrophe) to the allowed characters list, if the ASCII versions are present in that list. The resulting words still might not match the custom word list, though, because the smart characters wouldn't match the ASCII characters. So, perhaps a filter is a better approach? I'll have to give that more thought.

Let's focus on getting the extra characters down to the tokenizer, first. Once we have that, we at least have a work-around of disabling smart quotes, and that will give us time to work out a more elegant solution.

See sphinx-contrib#119 (comment)

stevepiercy · 2022-03-13T09:22:13Z

Picking this up again in draft PR #147. Feedback appreciated.

I'm not sure how to set up the tokenizer to accept the additional argument. As I said in an earlier comment, it doesn't appear to be part of the public API for accessing a tokenizer. I'm not sure if we're intended to configure the tokenizer after it is instantiated? Perhaps @dmerejkowsky can offer some advice about that.

I don't have any idea either. @dmerejkowsky do you have guidance?

stevepiercy added a commit to stevepiercy/spelling that referenced this issue Mar 13, 2022

Add valid word characters as a configuration option

4609309

See sphinx-contrib#119 (comment)

stevepiercy mentioned this issue Mar 13, 2022

Add valid word characters as a configuration option #147

Draft

stevepiercy mentioned this issue Mar 16, 2022

Public API to config arbitrary valid word characters; do not tokenize hyphenated words pyenchant/pyenchant#286

Open

dhellmann added the enhancement label May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyphenated words flag as misspelled even when in spelling_word_list_filename #119

Hyphenated words flag as misspelled even when in spelling_word_list_filename #119

ssenior45 commented Feb 9, 2021

dhellmann commented Feb 9, 2021

stevepiercy commented Sep 12, 2021

dhellmann commented Sep 12, 2021

stevepiercy commented Sep 12, 2021

dhellmann commented Sep 13, 2021

stevepiercy commented Mar 13, 2022

Hyphenated words flag as misspelled even when in spelling_word_list_filename #119

Hyphenated words flag as misspelled even when in spelling_word_list_filename #119

Comments

ssenior45 commented Feb 9, 2021

dhellmann commented Feb 9, 2021

stevepiercy commented Sep 12, 2021

dhellmann commented Sep 12, 2021

stevepiercy commented Sep 12, 2021

dhellmann commented Sep 13, 2021

stevepiercy commented Mar 13, 2022