Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenated words flag as misspelled even when in spelling_word_list_filename #119

Open
ssenior45 opened this issue Feb 9, 2021 · 6 comments

Comments

@ssenior45
Copy link

For example the word multi-regional is flagged by the spell checker as multi being misspelled.

I would like to be able to add the whole of multi-regional into my wordlist file as allowed, but this does not work.

Is this a known issue, and if so are there any workarounds? I've tried escaping the - in my wordlist file but it is still ignored.

Thanks.

@dhellmann
Copy link
Member

I think this has to do with the way the tokenizer extracts words from the text. It splits on whitespace and punctuation, so the hyphenated word ends up as 2 tokens.

If you find a way to change that behavior in PyExpect, or the underlying expect library, I would be happy to talk about configuration options to expose it in this layer, or even making it the default.

@stevepiercy
Copy link

I've hit this issue, too.

If you find a way to change that behavior in PyExpect, or the underlying expect library, I would be happy to talk about configuration options to expose it in this layer, or even making it the default.

I'm not clear about the PyExpect layer. I couldn't find anything about "hyphen" in their limited docs, whereas there is mention of tokenization in PyEnchant. Is this a typo?

@dhellmann
Copy link
Member

This is related to the same issue as #126

There's an option for the 'en' tokenizer to pass extra characters that should be considered part of a word. It's meant for apostrophes, but could be used for hyphens. It would apply to all words, though, so that may not be ideal in all cases. We could add a configuration option to expose that, default it to whatever the default is for PyEnchant, and then you could add a hyphen to the setting in your project.

It's not clear if that option is part of the formal API for all tokenizers, though, and there does not appear to be a way to pass the character list to get_tokenizer(), so I don't know if we could rely on it always being available.

Maybe one of you can do the research to determine how practical it is to add the configuration option?

@stevepiercy
Copy link

A quick test reveals success!

I edited these lines in my virtual environment, by hard-coding and extending the tuples with a hyphen:

https://github.com/pyenchant/pyenchant/blob/main/enchant/tokenize/en.py#L88-L100

    def _initialize_for_binary(self):
        self._consume_alpha = self._consume_alpha_b
        if self._valid_chars is None:
            self._valid_chars = ("'","-",)

    def _initialize_for_unicode(self):
        self._consume_alpha = self._consume_alpha_u
        if self._valid_chars is None:
            # XXX TODO: this doesn't seem to work correctly with the
            # MySpell provider, disabling for now.
            # Allow unicode typographic apostrophe
            # self._valid_chars = (u"'",u"\u2019")
            self._valid_chars = ("'","-",)

I then added the word built-in to my spelling_wordlist.txt, ran make spellcheck, and it no longer shows up as a misspelling.

Can you point me to how would I go about adding this as a configuration option?


Side note: make sure to set smartquotes=False in conf.py (default is True, if not set), else single-quotes and hyphens get converted to their "smart" typographical equivalents before spellchecker runs, flagging words such as "you've" and "doesn't". I did not see this mentioned in the docs when I did a quick search. Would this be a good thing to include in the docs?

@dhellmann
Copy link
Member

A quick test reveals success!

Woohoo!

Can you point me to how would I go about adding this as a configuration option?

https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/__init__.py#L34 is an example of defining a string option.

The builder init() function is the layer that knows about how to access the Sphinx configuration options and pass them to the checker (which wraps PyEnchant), so you'll want to add a new argument to the checker and pass the config option through. https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/builder.py#L39

The checker creates and initializes the tokenizer at https://github.com/sphinx-contrib/spelling/blob/master/sphinxcontrib/spelling/checker.py#L33

I'm not sure how to set up the tokenizer to accept the additional argument. As I said in an earlier comment, it doesn't appear to be part of the public API for accessing a tokenizer. I'm not sure if we're intended to configure the tokenizer after it is instantiated? Perhaps @dmerejkowsky can offer some advice about that.

Side note: make sure to set smartquotes=False in conf.py (default is True, if not set), else single-quotes and hyphens get converted to their "smart" typographical equivalents before spellchecker runs, flagging words such as "you've" and "doesn't". I did not see this mentioned in the docs when I did a quick search. Would this be a good thing to include in the docs?

We could handle that a couple of ways. Long term, I don't think we want to require users to turn off smart quotes, because that would affect how their content is published. I see 2 problems to solve: the characters to have the tokenizer remove and the characters to have it keep.

We would want the tokenizer to remove smart double quotes. I saw some logic for unicode processing in PyEnchant, but I didn't study it closely. Does it ignore smart quotes? If not, can we update it? Or give it an option that would cause it to? We could use the Sphinx config option for smart quote handling to control how we configure PyEnchant.

For the characters we want to keep, we could conceivably add a smart hyphen and smart single quote (apostrophe) to the allowed characters list, if the ASCII versions are present in that list. The resulting words still might not match the custom word list, though, because the smart characters wouldn't match the ASCII characters. So, perhaps a filter is a better approach? I'll have to give that more thought.

Let's focus on getting the extra characters down to the tokenizer, first. Once we have that, we at least have a work-around of disabling smart quotes, and that will give us time to work out a more elegant solution.

@stevepiercy
Copy link

Picking this up again in draft PR #147. Feedback appreciated.

I'm not sure how to set up the tokenizer to accept the additional argument. As I said in an earlier comment, it doesn't appear to be part of the public API for accessing a tokenizer. I'm not sure if we're intended to configure the tokenizer after it is instantiated? Perhaps @dmerejkowsky can offer some advice about that.

I don't have any idea either. @dmerejkowsky do you have guidance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants