Feature request - a better way to add support for illegal word combinations #120

oharboe · 2021-03-08T07:30:42Z

In our technical documentation, we want to catch combinations like "dual port", which is ambigious and replace it with clearer terms.

I've added the below in conf.py to check for such illegal word combinations.

Is there a better way to do this?

with open(os.path.join(root_folder, 'nomenclature', 'nomenclature.yaml')) as f:
    forbidden_word_combinations = list(
        map(
            lambda word: word.lower().split(),
            filter(lambda word: len(word.split()) > 1,
                   yaml.safe_load(f)['forbidden'].keys())))

forbidden_word_combinations_max = max(map(len, forbidden_word_combinations))


class MonkeyPatchedSpellingChecker(
        sphinxcontrib.spelling.checker.SpellingChecker):
    def __init__(self,
                 lang,
                 suggest,
                 word_list_filename,
                 tokenizer_lang='en_US',
                 filters=None,
                 context_line=False):
        super().__init__(lang, suggest, word_list_filename, tokenizer_lang,
                         filters, context_line)

    def check(self, text):
        """Yields bad words and suggested alternate spellings.
        """
        window = []
        for word, pos in self.tokenizer(text):
            window.append(word.lower())
            if len(window) > forbidden_word_combinations_max:
                window = window[1:forbidden_word_combinations_max + 1]
            illegal_combo = any(
                map(lambda combo: combo == window[0:len(combo)],
                    forbidden_word_combinations))
            correct = self.dictionary.check(word) and not illegal_combo
            if correct:
                continue

            suggestions = self.dictionary.suggest(word) if self.suggest else []
            line = sphinxcontrib.spelling.checker.line_of_index(
                text, pos) if self.context_line else ""

            yield (' '.join(window)
                   if illegal_combo else word), suggestions, line
        return


sphinxcontrib.spelling.checker.SpellingChecker = MonkeyPatchedSpellingChecker

The text was updated successfully, but these errors were encountered:

dhellmann · 2021-03-08T17:02:19Z

I like the idea of a "jargon" detector. The jargon isn't really misspelling, so it may work better in its own extension instead of trying to add it to this one. Maybe sphinxcontrib.jargon_detector? :-)

You would need to create a builder, but you wouldn't need the directive if the jargon file is always going to be outside of the documentation. The jargon file could probably just be a plain text file with 1 phrase per line, like the spelling word list file is. That would make it simpler to read than the YAML file. Unless maybe the input file offers suggested alternatives? In that case you would want to something more structured.

To look for multi-word strings, it might be simpler to look for the entire phrase in the input text as a substring, instead of running the text through the tokenizer and building up sequences of words. A substring search would be more efficient, and would avoid problems with the tokenizer changing the input (for example, contractions come out as 2 tokens today).

oharboe · 2021-03-08T17:07:42Z

@dhellmann We're also investigating https://github.com/errata-ai/vale and wondering if that's really what we need to set up rules for consistent terminology.

dhellmann · 2021-03-08T18:34:40Z

Vale does look useful. Thanks for the link!

oharboe · 2021-03-09T07:54:45Z

Some more monkey patching. We haven't looked into vale yet. It may be a better way to go about this than to try to force sphinxcontrib spelling into this service.

I share the conf.py between all documents and I have a I have a .yaml file per doc where I can enable/disable forbidden words checking and exclude file names. Excluding a page from checking is useful for the .rst files that lists all the forbidden words :-)

From conf.py:

class MonkeyPatchedSpellingBuilder(sphinxcontrib.spelling.SpellingBuilder):
    def _find_misspellings(self, docname, doctree):
        excluded = Matcher(settings['forbidden-excluded-pattern'])
        self.checker.illegal_words = settings['illegal-words'] and not excluded(
            self.env.doc2path(docname, None))
        return super()._find_misspellings(docname, doctree)


sphinxcontrib.spelling.SpellingBuilder = MonkeyPatchedSpellingBuilder

IPsec should not be IPSec. Fix all occurrences in the docs, and in source files used to auto-generate documentation. This commit does not change the occurrences of "IPSec" in the rest of the code base. Ideally, we would like to prevent "IPSec" to come back to the documentation via spell checks. However, I could not find a good solution to do that. This is based on the following considerations: - We currently have "ipsec" and "IPsec" in the list of spelling exceptions (the second is probably useless as the first should be treated by the spell checker as case-insensitive). They correspond to the syntax we accept when "ipsec" occurs in some technical terms, and for the rightfully-spelt "IPsec", respectively. The list of exceptions does not contain "IPSec" (even though it should be allowed due to "ipsec" being case-insensitive). - We can remove both occurrences from the list of spelling exceptions, and cover them instead by a custom filter, the same way as we do for "wireguard" and "WireGuard". This solution, however, does not work, for two reasons. - One reason is that "ipsec" sometimes appear in the middle of a compound-term, as in "something-ipsec-something", and a custom filter modeled after the one for WireGuard wouldn't "skip" it. We would have to make it more robust to detect "ipsec" in the middle of such compounds. - The other reason is that even if we accept only "ipsec" and "IPsec" in the custom filter, ... it turns out that the spell checker still accepts "IPSec". After some investigation, it turns out that this is because the spell checker configuration option spelling_ignore_wiki_words defaults to True. This option determines "whether words that follow the CamelCase conventions used for page names in wikis should be treated as spelled properly." As it turns out, "IPsec" does not follow these conventions, at least in the eyes of the spell checker, but "IPSec" does, making it a valid word. - Can we easily update a custom filter to make the spell checker reject a specific word? The answer is no, we can only skip words (and consider them correct) or tokenize them more: sphinx-contrib/spelling#120. - As a consequence, the best way to reject "IPSec" would be to set spelling_ignore_wiki_words to False in the configuration file. So I tried that, and obtained: Please fix the following documentation warnings: WARNING: Found 1271 misspelled words It turns out we have a lot of words that implicitely follow camel case conventions and are skipped by the spell checker for that reason. It's not worth adding all of them to the list of exceptions. Let's just accept that "IPSec" may come back, and clean it up from time to time. Signed-off-by: Quentin Monnet <[email protected]>

dhellmann mentioned this issue May 25, 2022

[RFE] Add the ability to check for non-desirable words #148

Closed

dhellmann added the enhancement label May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request - a better way to add support for illegal word combinations #120

Feature request - a better way to add support for illegal word combinations #120

oharboe commented Mar 8, 2021

dhellmann commented Mar 8, 2021

oharboe commented Mar 8, 2021

dhellmann commented Mar 8, 2021

oharboe commented Mar 9, 2021

Feature request - a better way to add support for illegal word combinations #120

Feature request - a better way to add support for illegal word combinations #120

Comments

oharboe commented Mar 8, 2021

dhellmann commented Mar 8, 2021

oharboe commented Mar 8, 2021

dhellmann commented Mar 8, 2021

oharboe commented Mar 9, 2021