Korean handwriting dataset parsed from the HangulDB.
Each image has different width and height. For the consistency with the original, I intentionally preserve the property.
This repo contains PE92, SERI95, and HanDB.
- PE92 contains 2350 classes, each with 100 samples.
- SERI95 contains 520 classes, each with 1000 samples.
- HANDB merges SERI95 and PE92. That is, 520 classes have 1100 samples and the others (1820 classes) have 100 samples.
Architecture
Three datasets have the same structure:
<dataset_name>/<label>/<sample_index>.jpg
warning
PE92 contains some mislabeled samples at the last few samples for each class.
parser.ipynb
parses a hgu1 file to several jpg files.
You can test whether it correctly parse the original dataset using parser.ipynb
.