-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request - a better way to add support for illegal word combinations #120
Comments
I like the idea of a "jargon" detector. The jargon isn't really misspelling, so it may work better in its own extension instead of trying to add it to this one. Maybe You would need to create a builder, but you wouldn't need the directive if the jargon file is always going to be outside of the documentation. The jargon file could probably just be a plain text file with 1 phrase per line, like the spelling word list file is. That would make it simpler to read than the YAML file. Unless maybe the input file offers suggested alternatives? In that case you would want to something more structured. To look for multi-word strings, it might be simpler to look for the entire phrase in the input text as a substring, instead of running the text through the tokenizer and building up sequences of words. A substring search would be more efficient, and would avoid problems with the tokenizer changing the input (for example, contractions come out as 2 tokens today). |
@dhellmann We're also investigating https://github.com/errata-ai/vale and wondering if that's really what we need to set up rules for consistent terminology. |
Vale does look useful. Thanks for the link! |
Some more monkey patching. We haven't looked into vale yet. It may be a better way to go about this than to try to force sphinxcontrib spelling into this service. I share the conf.py between all documents and I have a I have a .yaml file per doc where I can enable/disable forbidden words checking and exclude file names. Excluding a page from checking is useful for the .rst files that lists all the forbidden words :-) From conf.py:
|
IPsec should not be IPSec. Fix all occurrences in the docs, and in source files used to auto-generate documentation. This commit does not change the occurrences of "IPSec" in the rest of the code base. Ideally, we would like to prevent "IPSec" to come back to the documentation via spell checks. However, I could not find a good solution to do that. This is based on the following considerations: - We currently have "ipsec" and "IPsec" in the list of spelling exceptions (the second is probably useless as the first should be treated by the spell checker as case-insensitive). They correspond to the syntax we accept when "ipsec" occurs in some technical terms, and for the rightfully-spelt "IPsec", respectively. The list of exceptions does not contain "IPSec" (even though it should be allowed due to "ipsec" being case-insensitive). - We can remove both occurrences from the list of spelling exceptions, and cover them instead by a custom filter, the same way as we do for "wireguard" and "WireGuard". This solution, however, does not work, for two reasons. - One reason is that "ipsec" sometimes appear in the middle of a compound-term, as in "something-ipsec-something", and a custom filter modeled after the one for WireGuard wouldn't "skip" it. We would have to make it more robust to detect "ipsec" in the middle of such compounds. - The other reason is that even if we accept only "ipsec" and "IPsec" in the custom filter, ... it turns out that the spell checker still accepts "IPSec". After some investigation, it turns out that this is because the spell checker configuration option spelling_ignore_wiki_words defaults to True. This option determines "whether words that follow the CamelCase conventions used for page names in wikis should be treated as spelled properly." As it turns out, "IPsec" does not follow these conventions, at least in the eyes of the spell checker, but "IPSec" does, making it a valid word. - Can we easily update a custom filter to make the spell checker reject a specific word? The answer is no, we can only skip words (and consider them correct) or tokenize them more: sphinx-contrib/spelling#120. - As a consequence, the best way to reject "IPSec" would be to set spelling_ignore_wiki_words to False in the configuration file. So I tried that, and obtained: Please fix the following documentation warnings: WARNING: Found 1271 misspelled words It turns out we have a lot of words that implicitely follow camel case conventions and are skipped by the spell checker for that reason. It's not worth adding all of them to the list of exceptions. Let's just accept that "IPSec" may come back, and clean it up from time to time. Signed-off-by: Quentin Monnet <[email protected]>
IPsec should not be IPSec. Fix all occurrences in the docs, and in source files used to auto-generate documentation. This commit does not change the occurrences of "IPSec" in the rest of the code base. Ideally, we would like to prevent "IPSec" to come back to the documentation via spell checks. However, I could not find a good solution to do that. This is based on the following considerations: - We currently have "ipsec" and "IPsec" in the list of spelling exceptions (the second is probably useless as the first should be treated by the spell checker as case-insensitive). They correspond to the syntax we accept when "ipsec" occurs in some technical terms, and for the rightfully-spelt "IPsec", respectively. The list of exceptions does not contain "IPSec" (even though it should be allowed due to "ipsec" being case-insensitive). - We can remove both occurrences from the list of spelling exceptions, and cover them instead by a custom filter, the same way as we do for "wireguard" and "WireGuard". This solution, however, does not work, for two reasons. - One reason is that "ipsec" sometimes appear in the middle of a compound-term, as in "something-ipsec-something", and a custom filter modeled after the one for WireGuard wouldn't "skip" it. We would have to make it more robust to detect "ipsec" in the middle of such compounds. - The other reason is that even if we accept only "ipsec" and "IPsec" in the custom filter, ... it turns out that the spell checker still accepts "IPSec". After some investigation, it turns out that this is because the spell checker configuration option spelling_ignore_wiki_words defaults to True. This option determines "whether words that follow the CamelCase conventions used for page names in wikis should be treated as spelled properly." As it turns out, "IPsec" does not follow these conventions, at least in the eyes of the spell checker, but "IPSec" does, making it a valid word. - Can we easily update a custom filter to make the spell checker reject a specific word? The answer is no, we can only skip words (and consider them correct) or tokenize them more: sphinx-contrib/spelling#120. - As a consequence, the best way to reject "IPSec" would be to set spelling_ignore_wiki_words to False in the configuration file. So I tried that, and obtained: Please fix the following documentation warnings: WARNING: Found 1271 misspelled words It turns out we have a lot of words that implicitely follow camel case conventions and are skipped by the spell checker for that reason. It's not worth adding all of them to the list of exceptions. Let's just accept that "IPSec" may come back, and clean it up from time to time. Signed-off-by: Quentin Monnet <[email protected]>
IPsec should not be IPSec. Fix all occurrences in the docs, and in source files used to auto-generate documentation. This commit does not change the occurrences of "IPSec" in the rest of the code base. Ideally, we would like to prevent "IPSec" to come back to the documentation via spell checks. However, I could not find a good solution to do that. This is based on the following considerations: - We currently have "ipsec" and "IPsec" in the list of spelling exceptions (the second is probably useless as the first should be treated by the spell checker as case-insensitive). They correspond to the syntax we accept when "ipsec" occurs in some technical terms, and for the rightfully-spelt "IPsec", respectively. The list of exceptions does not contain "IPSec" (even though it should be allowed due to "ipsec" being case-insensitive). - We can remove both occurrences from the list of spelling exceptions, and cover them instead by a custom filter, the same way as we do for "wireguard" and "WireGuard". This solution, however, does not work, for two reasons. - One reason is that "ipsec" sometimes appear in the middle of a compound-term, as in "something-ipsec-something", and a custom filter modeled after the one for WireGuard wouldn't "skip" it. We would have to make it more robust to detect "ipsec" in the middle of such compounds. - The other reason is that even if we accept only "ipsec" and "IPsec" in the custom filter, ... it turns out that the spell checker still accepts "IPSec". After some investigation, it turns out that this is because the spell checker configuration option spelling_ignore_wiki_words defaults to True. This option determines "whether words that follow the CamelCase conventions used for page names in wikis should be treated as spelled properly." As it turns out, "IPsec" does not follow these conventions, at least in the eyes of the spell checker, but "IPSec" does, making it a valid word. - Can we easily update a custom filter to make the spell checker reject a specific word? The answer is no, we can only skip words (and consider them correct) or tokenize them more: sphinx-contrib/spelling#120. - As a consequence, the best way to reject "IPSec" would be to set spelling_ignore_wiki_words to False in the configuration file. So I tried that, and obtained: Please fix the following documentation warnings: WARNING: Found 1271 misspelled words It turns out we have a lot of words that implicitely follow camel case conventions and are skipped by the spell checker for that reason. It's not worth adding all of them to the list of exceptions. Let's just accept that "IPSec" may come back, and clean it up from time to time. Signed-off-by: Quentin Monnet <[email protected]>
IPsec should not be IPSec. Fix all occurrences in the docs, and in source files used to auto-generate documentation. This commit does not change the occurrences of "IPSec" in the rest of the code base. Ideally, we would like to prevent "IPSec" to come back to the documentation via spell checks. However, I could not find a good solution to do that. This is based on the following considerations: - We currently have "ipsec" and "IPsec" in the list of spelling exceptions (the second is probably useless as the first should be treated by the spell checker as case-insensitive). They correspond to the syntax we accept when "ipsec" occurs in some technical terms, and for the rightfully-spelt "IPsec", respectively. The list of exceptions does not contain "IPSec" (even though it should be allowed due to "ipsec" being case-insensitive). - We can remove both occurrences from the list of spelling exceptions, and cover them instead by a custom filter, the same way as we do for "wireguard" and "WireGuard". This solution, however, does not work, for two reasons. - One reason is that "ipsec" sometimes appear in the middle of a compound-term, as in "something-ipsec-something", and a custom filter modeled after the one for WireGuard wouldn't "skip" it. We would have to make it more robust to detect "ipsec" in the middle of such compounds. - The other reason is that even if we accept only "ipsec" and "IPsec" in the custom filter, ... it turns out that the spell checker still accepts "IPSec". After some investigation, it turns out that this is because the spell checker configuration option spelling_ignore_wiki_words defaults to True. This option determines "whether words that follow the CamelCase conventions used for page names in wikis should be treated as spelled properly." As it turns out, "IPsec" does not follow these conventions, at least in the eyes of the spell checker, but "IPSec" does, making it a valid word. - Can we easily update a custom filter to make the spell checker reject a specific word? The answer is no, we can only skip words (and consider them correct) or tokenize them more: sphinx-contrib/spelling#120. - As a consequence, the best way to reject "IPSec" would be to set spelling_ignore_wiki_words to False in the configuration file. So I tried that, and obtained: Please fix the following documentation warnings: WARNING: Found 1271 misspelled words It turns out we have a lot of words that implicitely follow camel case conventions and are skipped by the spell checker for that reason. It's not worth adding all of them to the list of exceptions. Let's just accept that "IPSec" may come back, and clean it up from time to time. Signed-off-by: Quentin Monnet <[email protected]>
In our technical documentation, we want to catch combinations like "dual port", which is ambigious and replace it with clearer terms.
I've added the below in conf.py to check for such illegal word combinations.
Is there a better way to do this?
The text was updated successfully, but these errors were encountered: