This is a fork of endgameinc/homoglyph. It requires Python 3
This repository adds a CNN
class that wraps the functionality and a Archive
class that creates a zip archive of the neural network and needed metadata.
Python 3 required!
This code was written and tested with Python 3.6
First you need to install the dependencies from requirements.txt, preferably with pip
It is recommended to use a virtual environment like virtualenv.
pip install -r requirements.txt
The sample models were trained with Arial.ttf and the original repository and the paper also used Arial.ttf.
You can view this for getting Arial.ttf:
You can install ttf-mscorefonts-installer
Look in c:\Windows\Fonts
Python 3 required!
is a simple command line utility. Run it like
$ python --help
Work with a siamese convolutional neural network
--loglevel LOGLEVEL The minimal loglevel to log. [default: warning]
--log-format LOG-FORMAT The log format or the name of the log format.
[default: [%(asctime)s] [%(levelname)-8s]
%(processName)s:%(process)d %(name)s: %(message)s]
--help Show this message and exit.
--version Show the version and exit.
archive Manage archives
predict Predict the similarity of pairs of strings
train Train or retrain a neural network
$ python train --help
Train or retrain a neural network
--font-size INTEGER RANGE The font size [default: 10]
--image-size <INTEGER INTEGER>...
The size of the image [default: 150, 12]
--text-location <INTEGER INTEGER>...
The starting location of the text [default:
0, 0]
--max-epochs INTEGER RANGE The maximum number of epochs to train
[default: 25]
--verbose / --quiet If the progress of the training steps should
be shown [default: False]
--model PATH The path of the archive of the model which
should be retrained
--test If this is a test run with very limited
dataset size [default: False]
--fast If this is a fast run with limited dataset
size [default: False]
--version INTEGER RANGE Which archive version to use
--help Show this message and exit.
To train you need to specify the font to use (the path to the font file) and the path to the data file.
The data must be a pickled dictionary of the following structure:
Keys train
, validate
and test
Each key holds a list of 3 tuples (str, str, float)
: two strings to compare and an expected similarity.
'train': [('string a', 'string b', 0.1), ...],
'validate': [('string a', 'string b', 0.1), ...],
'test': [('string a', 'string b', 0.1), ...],
The --test
mode drastically reduces the size of each dataset and the maximum number of epochs to train and should be used to test if the program still works.
The --fast
mode also reduces the size of each dataset and the max. epochs but not as much. It is used to get faster but worse results than a full training.
For example use the following to train domain names for test purposes:
$ python train Arial.ttf data\domain.pkl --test
Using TensorFlow backend.
TEST, reducing data and max. epochs
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
The Using TensorFlow backend.
and I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
messages are only informational messages and might vary.
If you want to retrain a neural network you have to specify the path to the archive of the model to retrain with --model path/to/
This currently only works with archives with version 2!
$ python predict --help
Usage: predict [OPTIONS] MODEL [DOMAINS]...
Predict the similarity of pairs of strings
--threshold THRESHOLD If given display only results where the prediction is
lower or equal this threshold
--help Show this message and exit.
To predict some similarities you need to specify the model to use and one or more string pairs as single strings.
$ python predict, ","
Using TensorFlow backend.
I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend. ~ = 0.05477425083518028 ~ = 0.8001943230628967
The Using TensorFlow backend.
and I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
messages are only informational messages and might vary.
They come from keras and tensorflow and can't be disabled unfortunately.
The archives are just ZIPs and should have the .zip
They contain metadata in a meta.json
- the version of the archive
- a build timestamp in unix time (seconds since 1 January 1970 00:00:00 UTC)
- data about the used font
- the name of the font file inside the archive
- a checksum of the font file
- some font type information
- data about the model
- the name of the model file inside the archive
- a checksum of the model file
- data about the used image
- font size
- image size
- starting location of the text
"version": 2,
"build": 1533389537,
"font": {
"filename": "font.ttf",
"checksum": "1171028651a0217165684f983cdf3a3b",
"type": ".ttf",
"name": "Arial",
"family": "Arial"
"model": {
"filename": "model.h5",
"checksum": "1f9e8174aea018fb7a9f3f836fc688f9"
"image": {
"font_size": 10,
"image_size": [150, 12],
"text_location": [0, 0]
and the needed files (model and font).
$ python archive
Usage: archive [OPTIONS] COMMAND [ARGS]...
Manage archives
--help Show this message and exit.
pack Create an archive from the given files
unpack Extract the given archive
$ python archive pack --help
Usage: archive pack [OPTIONS] FONT MODEL [ARCHIVE]
Create an v2 archive from the given files
--font-size INTEGER RANGE The font size [default: 10]
--image-size <INTEGER INTEGER>...
The size of the image [default: 150, 12]
--text-location <INTEGER INTEGER>...
The starting location of the text [default:
0, 0]
--help Show this message and exit.
This command packs the given files and metadata into an archive with version 2.
$ python archive pack Arial.ttf model.h5
$ python archive unpack --help
Usage: archive unpack [OPTIONS] ARCHIVE [OUTPUT]
Extract the given archive
--help Show this message and exit.
This command unpacks the given archive to the given directory or the current working directory if no output directory is given.
$ python archive unpack path/to/unpack/to/
$ ls path/to/unpack/to/
font.ttf meta.json model.h5