Pushed PyTesseract OCR outputs

Miller-Library · Aug 12, 2024 · 97e037b · 97e037b
1 parent 3f098b8
commit 97e037b
Show file tree

Hide file tree

Showing 2 changed files with 30 additions and 2 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -51,8 +51,11 @@ website:
         contents:
           - href: content/Text_Difference_Checker.ipynb
             text: difflib Transkribus Output Text Checker Notebook
-      - href: content/PyTesseract_OCR.ipynb
-        text: PyTesseract_OCR - Improving OCR'd Papers
+      - href: contents/
+        text: PyTesseract - Improving OCR'd Papers
+        contents:
+          - href: content/PyTesseract_OCR.ipynb
+            text: PyTesseract_OCR - Improving OCR'd Papers
 
 format:
   html:

diff --git a/content/pytesseract.qmd b/content/pytesseract.qmd
@@ -0,0 +1,25 @@
+---
+title: "PyTesseract"
+subtitle: "Improving OCR'd Student Papers to enhance citation detection for removal"
+page-layout: full
+
+---
+
+## Motivation for PyTesseract Approach
+
+The “Citation variations” approach is yielding about a 20% failure rate, mostly due to OCR errors that throw off the text-matching to citation terms.
+We discussed the idea of starting with images and trying to optimize the OCR results, rather than reverse engineer fixes to deal with poor OCR.
+As a result, we decided to use OpenCV to enhance the OCR’d student papers in hopes to increase the detection of citation pages leading to proper removal prior to the ingestion process.
+
+OpenCV (Open Computer Vision Library) is an open source library of programming functions aimed for real-time computer vision. CV tasks include methods for acquiring, processing and analyzing digital images and extraction of data to produce numerical or symbolic information. Alex has not used OpenCV before so he read up on the documentation and went through the OpenCV Bootcamp
+, a 3-hr course on  how to manipulate images and videos, and detect objects and faces.
+
+Optical Character Recognition (OCR) is a foundational technology behind the conversion of typed, handwritten, or printed text from images into machine-encoded text. OCR transforms a 2D image of text (machine or hand-written) from its image form into a machine-readable text. The OCR process generally consists of several sub-processes:
+
+Pre-processing of image
+Text Localization
+Character Segmentation
+Character Recognition
+Post Processing
+
+There are alot of OCR software available but one of the most popular is Tesseract. Python Tesseract (Pytesseract) is a Python library that serves as a wrapper for Google’s Tesseract-OCR engine. Essentially, it allows developers to use Tesseract’s OCR engine via Python.