Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use nltk for frequency analysis, refactoring and for clear text brute force? #95

Open
gogo2464 opened this issue Mar 26, 2022 · 9 comments

Comments

@gogo2464
Copy link

The frequency analysis values are currently hardcoded in cryptanalib/frequency.py.

I think if we use nltk instead of hardcoding, we will:
-gain in code lisibility
-be able to get frequency for more languages
-but also to compare brute forced decrypted text to check if the clear text correspond to an existing language

I need your opinion. If you agree, I can implement it.

@gogo2464
Copy link
Author

-but also to select a specific corpus with specific them in case we know the type of document to decipher
-select a corpus from the public avaible book published by the victim (or the CTF maker lol!) or with OSINT

@gogo2464
Copy link
Author

we currently do not really know where such data come from. this is not easy to maitain.

@unicornsasfuel
Copy link
Contributor

Thanks for your comment! Having support for alternate languages is definitely a goal, an issue already exists for that at #31.

However, I'd prefer not to add another dependency, especially one which relies on native code objects, since it limits the ways FD/CA can be used. Every additional dependency also means that the project may potentially break if the dependency breaks, or changes its API, making the code harder to maintain.

I have some comments on the proposed benefits:

gain in code lisibility

It's true that hard-coding frequency distributions is not ideal, but surely even nltk hard-codes that data somewhere, because the alternative is to re-generate the distribution data at runtime and hard-code in the corpus, which is far worse from the perspective of resources used for both storage and startup.

be able to get frequency for more languages
but also to select a specific corpus with specific them in case we know the type of document to decipher
select a corpus from the public avaible book published by the victim (or the CTF maker lol!) or with OSINT

There is currently a poorly-advertised script at util/generate_frequency_tables.py that consumes a file and produces a frequency distribution suitable for use with the functions in cryptanalib. It could really use some work, though.

we currently do not really know where such data come from. this is not easy to maitain.

A fair criticism. The English language data is based off of Charles Dickens' A Tale of Two Cities. I'm sure there is a better corpus that could be used, though I don't understand how this makes the project harder to maintain.

but also to compare brute forced decrypted text to check if the clear text correspond to an existing language

There is already a function called detect_plaintext() which does this; it is used in many parts of the existing code to enable many of the existing implemented attacks, such as the single-byte-xor cipher solver. However, it must be fed a particular distribution dict, it does not attempt to identify what distribution out of many it most closely matches. That would be a nice feature.

To me, it seems like effort would be better placed toward generating more frequency distributions, improving the built-in tool for generating frequency data, and, for better readability, changing the frequency module to dynamically load all files from a frequency_tables directory or some such, so users can generate their own frequency data and drop it in easily. It should also be better documented.

I'm still willing to be convinced otherwise, though.

@gogo2464
Copy link
Author

A fair criticism. The English language data is based off of Charles Dickens' A Tale of Two Cities. I'm sure there is a better corpus that could be used, though I don't understand how this makes the project harder to maintain.

Interesting. We should maybe just document it somewhere.

@gogo2464
Copy link
Author

-Nltk could be actually very very cool for word analysis in diffrent languages; More than caracters.
-select a specific category of text to get frequency like novel, sci-fi, lore, etc...

@gogo2464
Copy link
Author

In my opinion the best things to do is to:
-set default values from nltk to the hardcoded address
-allow people to generate their own frequencies from nltk to optionally replace the old with their own custom corpus
-also set more word from more languages in the hardcoded database

@gogo2464 gogo2464 changed the title use nltk for frequency analysis, refactoring and for clear text brute force use nltk for frequency analysis, refactoring and for clear text brute force? Mar 27, 2022
@unicornsasfuel
Copy link
Contributor

I would welcome any of these improvements if they could be made without adding nltk as a dependency.

@gogo2464
Copy link
Author

@unicornsasfuel understood! I will generate more more hardcoded stats in more languages with no nltk.

Just in case you want to get a look, I made a repo with nltk to compare. Look at https://github.com/gogo2464/cryptatools/blob/master/cryptalib/frequency.py

@gogo2464
Copy link
Author

@unicornsasfuel Hello. I am very sorry. Since the last time, I started a rewrite a very little bit inspired by featherduster with a very different philosohphy.

I do not have the time anymore for this PR. Sorry. I may finish it when I will have the time.

I may use my own repo to generate the data and hardcode it in your repo if I find the time.

Sorry for the inconvenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants