Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in word splitting #61

Open
jonyscathe opened this issue Mar 1, 2022 · 1 comment
Open

Improvements in word splitting #61

jonyscathe opened this issue Mar 1, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@jonyscathe
Copy link

jonyscathe commented Mar 1, 2022

Just wondering if there is any reason why camel case words aren't always split when checking spelling.

In particular if there are camelcase words in a comment then they flag with SC100
But also if there is some terribly named variable like TestData_apple then SC200 will flag 'TestData' as a spelling error whereas I would have expected 'Test' and 'Data' to have been checked separately by flake8-spellcheck

The other word splitting choice I am curious about is "words" that have digits in them. I have an electronics library with a fairly ridiculous whitelist containing things like 100n, 10m, 10n5, 1n, 1k, 200m, 250V, 250VAC, 33R4, 470R, 6V3, etc, etc which is a bit cumbersome.

Edit: Another odd word splitting thing.
If I have pd.Timestamp('2017-06-01T12')) then everything is fine, however if I have # pd.Timestamp('2017-06-01T12')) then I get a misspelt word of '2017. I understand single quotes are really apostrophes (I have argued within my team for double quoted strings with no luck) so splitting up words on single quotes isn't possible, but it is a bit annoying having to put things like '2017 in my whitelist.

@MichaelAquilina
Copy link
Owner

MichaelAquilina commented Mar 4, 2022

In particular if there are camelcase words in a comment then they flag with SC100
But also if there is some terribly named variable like TestData_apple then SC200 will flag 'TestData' as a spelling error whereas I would have expected 'Test' and 'Data' to have been checked separately by flake8-spellcheck

The reason for this is that words are classified as either "camel case" or "snake case" with a simple set of heuristics.

So in the case of TestData_apple, the _ would classify this word as snake cased and split it into the tokens TestData and apple.

Admittedly, this could be improved but it would mean introducing some potential future corner cases and further processing (although I am open to the idea if you want to give a go at opening a PR to try fix this).

The other word splitting choice I am curious about is "words" that have digits in them. I have an electronics library with a fairly ridiculous whitelist containing things like 100n, 10m, 10n5, 1n, 1k, 200m, 250V, 250VAC, 33R4, 470R, 6V3, etc, etc which is a bit cumbersome.

Yep this is not currently catered for in the plugin. Again, not an impossible issue to solve, but it would introduce some corner cases of its own. A few approaches could be:

  • allow anything of the form <number><character>
  • allow the user to specify regex or glob patterns in the whitelist file

If I have pd.Timestamp('2017-06-01T12')) then everything is fine, however if I have # pd.Timestamp('2017-06-01T12')) then I get a misspelt word of '2017. I understand single quotes are really apostrophes (I have argued within my team for double quoted strings with no luck) so splitting up words on single quotes isn't possible, but it is a bit annoying having to put things like '2017 in my whitelist.

Agreed that sounds annoying. It's unfortunately quite tough to cover these cases correctly without some human intervention though because this is essentially impossible to parse correctly without running it through an AST. I would suggest marking the line as ignored for the specific flake8 error code this plugin gives.

@MichaelAquilina MichaelAquilina added the enhancement New feature or request label Mar 4, 2022
@MichaelAquilina MichaelAquilina changed the title Word splitting Improvements in word splitting Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants