Improvements in word splitting #61

jonyscathe · 2022-03-01T22:39:10Z

Just wondering if there is any reason why camel case words aren't always split when checking spelling.

In particular if there are camelcase words in a comment then they flag with SC100
But also if there is some terribly named variable like TestData_apple then SC200 will flag 'TestData' as a spelling error whereas I would have expected 'Test' and 'Data' to have been checked separately by flake8-spellcheck

The other word splitting choice I am curious about is "words" that have digits in them. I have an electronics library with a fairly ridiculous whitelist containing things like 100n, 10m, 10n5, 1n, 1k, 200m, 250V, 250VAC, 33R4, 470R, 6V3, etc, etc which is a bit cumbersome.

Edit: Another odd word splitting thing.
If I have pd.Timestamp('2017-06-01T12')) then everything is fine, however if I have # pd.Timestamp('2017-06-01T12')) then I get a misspelt word of '2017. I understand single quotes are really apostrophes (I have argued within my team for double quoted strings with no luck) so splitting up words on single quotes isn't possible, but it is a bit annoying having to put things like '2017 in my whitelist.

The text was updated successfully, but these errors were encountered:

MichaelAquilina · 2022-03-04T12:11:06Z

In particular if there are camelcase words in a comment then they flag with SC100
But also if there is some terribly named variable like TestData_apple then SC200 will flag 'TestData' as a spelling error whereas I would have expected 'Test' and 'Data' to have been checked separately by flake8-spellcheck

The reason for this is that words are classified as either "camel case" or "snake case" with a simple set of heuristics.

So in the case of TestData_apple, the _ would classify this word as snake cased and split it into the tokens TestData and apple.

Admittedly, this could be improved but it would mean introducing some potential future corner cases and further processing (although I am open to the idea if you want to give a go at opening a PR to try fix this).

The other word splitting choice I am curious about is "words" that have digits in them. I have an electronics library with a fairly ridiculous whitelist containing things like 100n, 10m, 10n5, 1n, 1k, 200m, 250V, 250VAC, 33R4, 470R, 6V3, etc, etc which is a bit cumbersome.

Yep this is not currently catered for in the plugin. Again, not an impossible issue to solve, but it would introduce some corner cases of its own. A few approaches could be:

allow anything of the form <number><character>
allow the user to specify regex or glob patterns in the whitelist file

If I have pd.Timestamp('2017-06-01T12')) then everything is fine, however if I have # pd.Timestamp('2017-06-01T12')) then I get a misspelt word of '2017. I understand single quotes are really apostrophes (I have argued within my team for double quoted strings with no luck) so splitting up words on single quotes isn't possible, but it is a bit annoying having to put things like '2017 in my whitelist.

Agreed that sounds annoying. It's unfortunately quite tough to cover these cases correctly without some human intervention though because this is essentially impossible to parse correctly without running it through an AST. I would suggest marking the line as ignored for the specific flake8 error code this plugin gives.

MichaelAquilina added the enhancement New feature or request label Mar 4, 2022

MichaelAquilina changed the title ~~Word splitting~~ Improvements in word splitting Mar 4, 2022

shaleh mentioned this issue Mar 17, 2022

Words that have numbers on the end fail. #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements in word splitting #61

Improvements in word splitting #61

jonyscathe commented Mar 1, 2022 •

edited

Loading

MichaelAquilina commented Mar 4, 2022 •

edited

Loading

Improvements in word splitting #61

Improvements in word splitting #61

Comments

jonyscathe commented Mar 1, 2022 • edited Loading

MichaelAquilina commented Mar 4, 2022 • edited Loading

jonyscathe commented Mar 1, 2022 •

edited

Loading

MichaelAquilina commented Mar 4, 2022 •

edited

Loading