In the OCR domain, line segmentation models are used to divide images containing text into individual lines. However, these models are not always perfect. To address this, I developed a package to classify line images as either good or bad based on their quality.
Clustering is an unsupervised machine learning technique used to group similar data points. In this package, I applied K-means clustering to OCR. This package enables the automated detection of poorly cropped images, improving the efficiency and accuracy of digital document processing without requiring labeled training data.
pip install git+https://github.com/tenzin3/line_image_classifier.git
In most cases, desired line image outputs have similar dimensions, making it efficient to cluster images based on their size. This method works well for identifying outliers with significantly different dimensions.
from pathlib import Path
from line_image_classifier.pipeline import classify_with_size
images = list(Path("ocr_images").rglob("*.json"))
output_path = Path('size_based_clusters.json')
classify_with_size(images, output_path)
In some cases, bad line images may have similar dimensions to good ones but are incorrect due to issues like rotation or excessive zoom. For these scenarios, classification based on image features is essential. VGG16 is used to extract image features, followed by dimensionality reduction using PCA, and clustering is then performed to group similar images effectively.
from line_image_classifier.pipeline import classify_with_feature
images = list(Path("ocr_images").rglob("*.json"))
output_path = Path('feature_based_clusters.json')
classify_with_feature(images, output_path)
A JSON file with cluster numbers as keys and lists of image paths belonging to each group as values. PDF files, with each PDF containing line images (10 images per page) from a specific cluster group, for better visualization of result.