Skip to content

Commit

Permalink
Pushed PyTesseract OCR outputs
Browse files Browse the repository at this point in the history
  • Loading branch information
AlaoSUL committed Aug 12, 2024
1 parent 3f098b8 commit 97e037b
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 2 deletions.
7 changes: 5 additions & 2 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,11 @@ website:
contents:
- href: content/Text_Difference_Checker.ipynb
text: difflib Transkribus Output Text Checker Notebook
- href: content/PyTesseract_OCR.ipynb
text: PyTesseract_OCR - Improving OCR'd Papers
- href: contents/
text: PyTesseract - Improving OCR'd Papers
contents:
- href: content/PyTesseract_OCR.ipynb
text: PyTesseract_OCR - Improving OCR'd Papers

format:
html:
Expand Down
25 changes: 25 additions & 0 deletions content/pytesseract.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: "PyTesseract"
subtitle: "Improving OCR'd Student Papers to enhance citation detection for removal"
page-layout: full

---

## Motivation for PyTesseract Approach

The “Citation variations” approach is yielding about a 20% failure rate, mostly due to OCR errors that throw off the text-matching to citation terms.
We discussed the idea of starting with images and trying to optimize the OCR results, rather than reverse engineer fixes to deal with poor OCR.
As a result, we decided to use OpenCV to enhance the OCR’d student papers in hopes to increase the detection of citation pages leading to proper removal prior to the ingestion process.

OpenCV (Open Computer Vision Library) is an open source library of programming functions aimed for real-time computer vision. CV tasks include methods for acquiring, processing and analyzing digital images and extraction of data to produce numerical or symbolic information. Alex has not used OpenCV before so he read up on the documentation and went through the OpenCV Bootcamp
, a 3-hr course on how to manipulate images and videos, and detect objects and faces.

Optical Character Recognition (OCR) is a foundational technology behind the conversion of typed, handwritten, or printed text from images into machine-encoded text. OCR transforms a 2D image of text (machine or hand-written) from its image form into a machine-readable text. The OCR process generally consists of several sub-processes:

Pre-processing of image
Text Localization
Character Segmentation
Character Recognition
Post Processing

There are alot of OCR software available but one of the most popular is Tesseract. Python Tesseract (Pytesseract) is a Python library that serves as a wrapper for Google’s Tesseract-OCR engine. Essentially, it allows developers to use Tesseract’s OCR engine via Python.

0 comments on commit 97e037b

Please sign in to comment.