-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use nltk for frequency analysis, refactoring and for clear text brute force? #95
Comments
-but also to select a specific corpus with specific them in case we know the type of document to decipher |
we currently do not really know where such data come from. this is not easy to maitain. |
Thanks for your comment! Having support for alternate languages is definitely a goal, an issue already exists for that at #31. However, I'd prefer not to add another dependency, especially one which relies on native code objects, since it limits the ways FD/CA can be used. Every additional dependency also means that the project may potentially break if the dependency breaks, or changes its API, making the code harder to maintain. I have some comments on the proposed benefits:
It's true that hard-coding frequency distributions is not ideal, but surely even nltk hard-codes that data somewhere, because the alternative is to re-generate the distribution data at runtime and hard-code in the corpus, which is far worse from the perspective of resources used for both storage and startup.
There is currently a poorly-advertised script at
A fair criticism. The English language data is based off of Charles Dickens' A Tale of Two Cities. I'm sure there is a better corpus that could be used, though I don't understand how this makes the project harder to maintain.
There is already a function called To me, it seems like effort would be better placed toward generating more frequency distributions, improving the built-in tool for generating frequency data, and, for better readability, changing the frequency module to dynamically load all files from a I'm still willing to be convinced otherwise, though. |
Interesting. We should maybe just document it somewhere. |
-Nltk could be actually very very cool for word analysis in diffrent languages; More than caracters. |
In my opinion the best things to do is to: |
I would welcome any of these improvements if they could be made without adding nltk as a dependency. |
@unicornsasfuel understood! I will generate more more hardcoded stats in more languages with no nltk. Just in case you want to get a look, I made a repo with nltk to compare. Look at https://github.com/gogo2464/cryptatools/blob/master/cryptalib/frequency.py |
@unicornsasfuel Hello. I am very sorry. Since the last time, I started a rewrite a very little bit inspired by featherduster with a very different philosohphy. I do not have the time anymore for this PR. Sorry. I may finish it when I will have the time. I may use my own repo to generate the data and hardcode it in your repo if I find the time. Sorry for the inconvenience. |
The frequency analysis values are currently hardcoded in cryptanalib/frequency.py.
I think if we use nltk instead of hardcoding, we will:
-gain in code lisibility
-be able to get frequency for more languages
-but also to compare brute forced decrypted text to check if the clear text correspond to an existing language
I need your opinion. If you agree, I can implement it.
The text was updated successfully, but these errors were encountered: