Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request - a better way to add support for illegal word combinations #120

Open
oharboe opened this issue Mar 8, 2021 · 4 comments

Comments

@oharboe
Copy link

oharboe commented Mar 8, 2021

In our technical documentation, we want to catch combinations like "dual port", which is ambigious and replace it with clearer terms.

I've added the below in conf.py to check for such illegal word combinations.

Is there a better way to do this?

with open(os.path.join(root_folder, 'nomenclature', 'nomenclature.yaml')) as f:
    forbidden_word_combinations = list(
        map(
            lambda word: word.lower().split(),
            filter(lambda word: len(word.split()) > 1,
                   yaml.safe_load(f)['forbidden'].keys())))

forbidden_word_combinations_max = max(map(len, forbidden_word_combinations))


class MonkeyPatchedSpellingChecker(
        sphinxcontrib.spelling.checker.SpellingChecker):
    def __init__(self,
                 lang,
                 suggest,
                 word_list_filename,
                 tokenizer_lang='en_US',
                 filters=None,
                 context_line=False):
        super().__init__(lang, suggest, word_list_filename, tokenizer_lang,
                         filters, context_line)

    def check(self, text):
        """Yields bad words and suggested alternate spellings.
        """
        window = []
        for word, pos in self.tokenizer(text):
            window.append(word.lower())
            if len(window) > forbidden_word_combinations_max:
                window = window[1:forbidden_word_combinations_max + 1]
            illegal_combo = any(
                map(lambda combo: combo == window[0:len(combo)],
                    forbidden_word_combinations))
            correct = self.dictionary.check(word) and not illegal_combo
            if correct:
                continue

            suggestions = self.dictionary.suggest(word) if self.suggest else []
            line = sphinxcontrib.spelling.checker.line_of_index(
                text, pos) if self.context_line else ""

            yield (' '.join(window)
                   if illegal_combo else word), suggestions, line
        return


sphinxcontrib.spelling.checker.SpellingChecker = MonkeyPatchedSpellingChecker
@dhellmann
Copy link
Member

I like the idea of a "jargon" detector. The jargon isn't really misspelling, so it may work better in its own extension instead of trying to add it to this one. Maybe sphinxcontrib.jargon_detector? :-)

You would need to create a builder, but you wouldn't need the directive if the jargon file is always going to be outside of the documentation. The jargon file could probably just be a plain text file with 1 phrase per line, like the spelling word list file is. That would make it simpler to read than the YAML file. Unless maybe the input file offers suggested alternatives? In that case you would want to something more structured.

To look for multi-word strings, it might be simpler to look for the entire phrase in the input text as a substring, instead of running the text through the tokenizer and building up sequences of words. A substring search would be more efficient, and would avoid problems with the tokenizer changing the input (for example, contractions come out as 2 tokens today).

@oharboe
Copy link
Author

oharboe commented Mar 8, 2021

@dhellmann We're also investigating https://github.com/errata-ai/vale and wondering if that's really what we need to set up rules for consistent terminology.

@dhellmann
Copy link
Member

Vale does look useful. Thanks for the link!

@oharboe
Copy link
Author

oharboe commented Mar 9, 2021

Some more monkey patching. We haven't looked into vale yet. It may be a better way to go about this than to try to force sphinxcontrib spelling into this service.

I share the conf.py between all documents and I have a I have a .yaml file per doc where I can enable/disable forbidden words checking and exclude file names. Excluding a page from checking is useful for the .rst files that lists all the forbidden words :-)

From conf.py:

class MonkeyPatchedSpellingBuilder(sphinxcontrib.spelling.SpellingBuilder):
    def _find_misspellings(self, docname, doctree):
        excluded = Matcher(settings['forbidden-excluded-pattern'])
        self.checker.illegal_words = settings['illegal-words'] and not excluded(
            self.env.doc2path(docname, None))
        return super()._find_misspellings(docname, doctree)


sphinxcontrib.spelling.SpellingBuilder = MonkeyPatchedSpellingBuilder

qmonnet added a commit to qmonnet/cilium that referenced this issue Apr 16, 2024
IPsec should not be IPSec. Fix all occurrences in the docs, and in
source files used to auto-generate documentation.

This commit does not change the occurrences of "IPSec" in the rest of
the code base.

Ideally, we would like to prevent "IPSec" to come back to the
documentation via spell checks. However, I could not find a good
solution to do that. This is based on the following considerations:

- We currently have "ipsec" and "IPsec" in the list of spelling
  exceptions (the second is probably useless as the first should be
  treated by the spell checker as case-insensitive). They correspond to
  the syntax we accept when "ipsec" occurs in some technical terms, and
  for the rightfully-spelt "IPsec", respectively. The list of exceptions
  does not contain "IPSec" (even though it should be allowed due to
  "ipsec" being case-insensitive).

- We can remove both occurrences from the list of spelling exceptions,
  and cover them instead by a custom filter, the same way as we do for
  "wireguard" and "WireGuard". This solution, however, does not work,
  for two reasons.

- One reason is that "ipsec" sometimes appear in the middle of a
  compound-term, as in "something-ipsec-something", and a custom filter
  modeled after the one for WireGuard wouldn't "skip" it. We would have
  to make it more robust to detect "ipsec" in the middle of such
  compounds.

- The other reason is that even if we accept only "ipsec" and "IPsec" in
  the custom filter, ... it turns out that the spell checker still
  accepts "IPSec". After some investigation, it turns out that this is
  because the spell checker configuration option
  spelling_ignore_wiki_words defaults to True. This option determines
  "whether words that follow the CamelCase conventions used for page
  names in wikis should be treated as spelled properly." As it turns
  out, "IPsec" does not follow these conventions, at least in the eyes
  of the spell checker, but "IPSec" does, making it a valid word.

- Can we easily update a custom filter to make the spell checker reject
  a specific word? The answer is no, we can only skip words (and
  consider them correct) or tokenize them more:
  sphinx-contrib/spelling#120.

- As a consequence, the best way to reject "IPSec" would be to set
  spelling_ignore_wiki_words to False in the configuration file. So I
  tried that, and obtained:

    Please fix the following documentation warnings:
    WARNING: Found 1271 misspelled words

  It turns out we have a lot of words that implicitely follow camel case
  conventions and are skipped by the spell checker for that reason. It's
  not worth adding all of them to the list of exceptions.

Let's just accept that "IPSec" may come back, and clean it up from time
to time.

Signed-off-by: Quentin Monnet <[email protected]>
qmonnet added a commit to cilium/cilium that referenced this issue Apr 16, 2024
IPsec should not be IPSec. Fix all occurrences in the docs, and in
source files used to auto-generate documentation.

This commit does not change the occurrences of "IPSec" in the rest of
the code base.

Ideally, we would like to prevent "IPSec" to come back to the
documentation via spell checks. However, I could not find a good
solution to do that. This is based on the following considerations:

- We currently have "ipsec" and "IPsec" in the list of spelling
  exceptions (the second is probably useless as the first should be
  treated by the spell checker as case-insensitive). They correspond to
  the syntax we accept when "ipsec" occurs in some technical terms, and
  for the rightfully-spelt "IPsec", respectively. The list of exceptions
  does not contain "IPSec" (even though it should be allowed due to
  "ipsec" being case-insensitive).

- We can remove both occurrences from the list of spelling exceptions,
  and cover them instead by a custom filter, the same way as we do for
  "wireguard" and "WireGuard". This solution, however, does not work,
  for two reasons.

- One reason is that "ipsec" sometimes appear in the middle of a
  compound-term, as in "something-ipsec-something", and a custom filter
  modeled after the one for WireGuard wouldn't "skip" it. We would have
  to make it more robust to detect "ipsec" in the middle of such
  compounds.

- The other reason is that even if we accept only "ipsec" and "IPsec" in
  the custom filter, ... it turns out that the spell checker still
  accepts "IPSec". After some investigation, it turns out that this is
  because the spell checker configuration option
  spelling_ignore_wiki_words defaults to True. This option determines
  "whether words that follow the CamelCase conventions used for page
  names in wikis should be treated as spelled properly." As it turns
  out, "IPsec" does not follow these conventions, at least in the eyes
  of the spell checker, but "IPSec" does, making it a valid word.

- Can we easily update a custom filter to make the spell checker reject
  a specific word? The answer is no, we can only skip words (and
  consider them correct) or tokenize them more:
  sphinx-contrib/spelling#120.

- As a consequence, the best way to reject "IPSec" would be to set
  spelling_ignore_wiki_words to False in the configuration file. So I
  tried that, and obtained:

    Please fix the following documentation warnings:
    WARNING: Found 1271 misspelled words

  It turns out we have a lot of words that implicitely follow camel case
  conventions and are skipped by the spell checker for that reason. It's
  not worth adding all of them to the list of exceptions.

Let's just accept that "IPSec" may come back, and clean it up from time
to time.

Signed-off-by: Quentin Monnet <[email protected]>
qmonnet added a commit to qmonnet/cilium that referenced this issue Apr 23, 2024
IPsec should not be IPSec. Fix all occurrences in the docs, and in
source files used to auto-generate documentation.

This commit does not change the occurrences of "IPSec" in the rest of
the code base.

Ideally, we would like to prevent "IPSec" to come back to the
documentation via spell checks. However, I could not find a good
solution to do that. This is based on the following considerations:

- We currently have "ipsec" and "IPsec" in the list of spelling
  exceptions (the second is probably useless as the first should be
  treated by the spell checker as case-insensitive). They correspond to
  the syntax we accept when "ipsec" occurs in some technical terms, and
  for the rightfully-spelt "IPsec", respectively. The list of exceptions
  does not contain "IPSec" (even though it should be allowed due to
  "ipsec" being case-insensitive).

- We can remove both occurrences from the list of spelling exceptions,
  and cover them instead by a custom filter, the same way as we do for
  "wireguard" and "WireGuard". This solution, however, does not work,
  for two reasons.

- One reason is that "ipsec" sometimes appear in the middle of a
  compound-term, as in "something-ipsec-something", and a custom filter
  modeled after the one for WireGuard wouldn't "skip" it. We would have
  to make it more robust to detect "ipsec" in the middle of such
  compounds.

- The other reason is that even if we accept only "ipsec" and "IPsec" in
  the custom filter, ... it turns out that the spell checker still
  accepts "IPSec". After some investigation, it turns out that this is
  because the spell checker configuration option
  spelling_ignore_wiki_words defaults to True. This option determines
  "whether words that follow the CamelCase conventions used for page
  names in wikis should be treated as spelled properly." As it turns
  out, "IPsec" does not follow these conventions, at least in the eyes
  of the spell checker, but "IPSec" does, making it a valid word.

- Can we easily update a custom filter to make the spell checker reject
  a specific word? The answer is no, we can only skip words (and
  consider them correct) or tokenize them more:
  sphinx-contrib/spelling#120.

- As a consequence, the best way to reject "IPSec" would be to set
  spelling_ignore_wiki_words to False in the configuration file. So I
  tried that, and obtained:

    Please fix the following documentation warnings:
    WARNING: Found 1271 misspelled words

  It turns out we have a lot of words that implicitely follow camel case
  conventions and are skipped by the spell checker for that reason. It's
  not worth adding all of them to the list of exceptions.

Let's just accept that "IPSec" may come back, and clean it up from time
to time.

Signed-off-by: Quentin Monnet <[email protected]>
github-merge-queue bot pushed a commit to cilium/cilium that referenced this issue Apr 23, 2024
IPsec should not be IPSec. Fix all occurrences in the docs, and in
source files used to auto-generate documentation.

This commit does not change the occurrences of "IPSec" in the rest of
the code base.

Ideally, we would like to prevent "IPSec" to come back to the
documentation via spell checks. However, I could not find a good
solution to do that. This is based on the following considerations:

- We currently have "ipsec" and "IPsec" in the list of spelling
  exceptions (the second is probably useless as the first should be
  treated by the spell checker as case-insensitive). They correspond to
  the syntax we accept when "ipsec" occurs in some technical terms, and
  for the rightfully-spelt "IPsec", respectively. The list of exceptions
  does not contain "IPSec" (even though it should be allowed due to
  "ipsec" being case-insensitive).

- We can remove both occurrences from the list of spelling exceptions,
  and cover them instead by a custom filter, the same way as we do for
  "wireguard" and "WireGuard". This solution, however, does not work,
  for two reasons.

- One reason is that "ipsec" sometimes appear in the middle of a
  compound-term, as in "something-ipsec-something", and a custom filter
  modeled after the one for WireGuard wouldn't "skip" it. We would have
  to make it more robust to detect "ipsec" in the middle of such
  compounds.

- The other reason is that even if we accept only "ipsec" and "IPsec" in
  the custom filter, ... it turns out that the spell checker still
  accepts "IPSec". After some investigation, it turns out that this is
  because the spell checker configuration option
  spelling_ignore_wiki_words defaults to True. This option determines
  "whether words that follow the CamelCase conventions used for page
  names in wikis should be treated as spelled properly." As it turns
  out, "IPsec" does not follow these conventions, at least in the eyes
  of the spell checker, but "IPSec" does, making it a valid word.

- Can we easily update a custom filter to make the spell checker reject
  a specific word? The answer is no, we can only skip words (and
  consider them correct) or tokenize them more:
  sphinx-contrib/spelling#120.

- As a consequence, the best way to reject "IPSec" would be to set
  spelling_ignore_wiki_words to False in the configuration file. So I
  tried that, and obtained:

    Please fix the following documentation warnings:
    WARNING: Found 1271 misspelled words

  It turns out we have a lot of words that implicitely follow camel case
  conventions and are skipped by the spell checker for that reason. It's
  not worth adding all of them to the list of exceptions.

Let's just accept that "IPSec" may come back, and clean it up from time
to time.

Signed-off-by: Quentin Monnet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants