Skip to content

tenzin3/line_image_classifier

Repository files navigation

Line Image Classifier

image

Introduction

In the OCR domain, line segmentation models are used to divide images containing text into individual lines. However, these models are not always perfect. To address this, I developed a package to classify line images as either good or bad based on their quality.

Description

Clustering is an unsupervised machine learning technique used to group similar data points. In this package, I applied K-means clustering to OCR. This package enables the automated detection of poorly cropped images, improving the efficiency and accuracy of digital document processing without requiring labeled training data.

Input: Pecha image:

I1KG812750008

Line image detection model outputs:>

image image image image image image image image

Installation

pip install git+https://github.com/tenzin3/line_image_classifier.git

Classify based on Image Size

In most cases, desired line image outputs have similar dimensions, making it efficient to cluster images based on their size. This method works well for identifying outliers with significantly different dimensions.

from pathlib import Path 
from line_image_classifier.pipeline import classify_with_size

images = list(Path("ocr_images").rglob("*.json"))
output_path = Path('size_based_clusters.json')
classify_with_size(images, output_path)

Classify based on the Image feature

In some cases, bad line images may have similar dimensions to good ones but are incorrect due to issues like rotation or excessive zoom. For these scenarios, classification based on image features is essential. VGG16 is used to extract image features, followed by dimensionality reduction using PCA, and clustering is then performed to group similar images effectively.

from line_image_classifier.pipeline import classify_with_feature

images = list(Path("ocr_images").rglob("*.json"))
output_path = Path('feature_based_clusters.json')
classify_with_feature(images, output_path)

Output

A JSON file with cluster numbers as keys and lists of image paths belonging to each group as values. PDF files, with each PDF containing line images (10 images per page) from a specific cluster group, for better visualization of result.

About

Classifying bad line images from good line images.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages