GitHub - tenzin3/line_image_classifier: Classifying bad line images from good line images.

Line Image Classifier

Introduction

In the OCR domain, line segmentation models are used to divide images containing text into individual lines. However, these models are not always perfect. To address this, I developed a package to classify line images as either good or bad based on their quality.

Description

Clustering is an unsupervised machine learning technique used to group similar data points. In this package, I applied K-means clustering to OCR. This package enables the automated detection of poorly cropped images, improving the efficiency and accuracy of digital document processing without requiring labeled training data.

Input: Pecha image:

Line image detection model outputs:>

Installation

pip install git+https://github.com/tenzin3/line_image_classifier.git

Classify based on Image Size

In most cases, desired line image outputs have similar dimensions, making it efficient to cluster images based on their size. This method works well for identifying outliers with significantly different dimensions.

from pathlib import Path 
from line_image_classifier.pipeline import classify_with_size

images = list(Path("ocr_images").rglob("*.json"))
output_path = Path('size_based_clusters.json')
classify_with_size(images, output_path)

Classify based on the Image feature

In some cases, bad line images may have similar dimensions to good ones but are incorrect due to issues like rotation or excessive zoom. For these scenarios, classification based on image features is essential. VGG16 is used to extract image features, followed by dimensionality reduction using PCA, and clustering is then performed to group similar images effectively.

from line_image_classifier.pipeline import classify_with_feature

images = list(Path("ocr_images").rglob("*.json"))
output_path = Path('feature_based_clusters.json')
classify_with_feature(images, output_path)

Output

A JSON file with cluster numbers as keys and lists of image paths belonging to each group as values. PDF files, with each PDF containing line images (10 images per page) from a specific cluster group, for better visualization of result.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
src/line_image_classifier		src/line_image_classifier
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Line Image Classifier

Introduction

Description

Input: Pecha image:

Line image detection model outputs:>

Installation

Classify based on Image Size

Classify based on the Image feature

Output

About

Releases

Packages

Languages

License

tenzin3/line_image_classifier

Folders and files

Latest commit

History

Repository files navigation

Line Image Classifier

Introduction

Description

Input: Pecha image:

Line image detection model outputs:>

Installation

Classify based on Image Size

Classify based on the Image feature

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages