generated from nmfs-opensci/NOAA-quarto-simple
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
30 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
--- | ||
title: "PyTesseract" | ||
subtitle: "Improving OCR'd Student Papers to enhance citation detection for removal" | ||
page-layout: full | ||
|
||
--- | ||
|
||
## Motivation for PyTesseract Approach | ||
|
||
The “Citation variations” approach is yielding about a 20% failure rate, mostly due to OCR errors that throw off the text-matching to citation terms. | ||
We discussed the idea of starting with images and trying to optimize the OCR results, rather than reverse engineer fixes to deal with poor OCR. | ||
As a result, we decided to use OpenCV to enhance the OCR’d student papers in hopes to increase the detection of citation pages leading to proper removal prior to the ingestion process. | ||
|
||
OpenCV (Open Computer Vision Library) is an open source library of programming functions aimed for real-time computer vision. CV tasks include methods for acquiring, processing and analyzing digital images and extraction of data to produce numerical or symbolic information. Alex has not used OpenCV before so he read up on the documentation and went through the OpenCV Bootcamp | ||
, a 3-hr course on how to manipulate images and videos, and detect objects and faces. | ||
|
||
Optical Character Recognition (OCR) is a foundational technology behind the conversion of typed, handwritten, or printed text from images into machine-encoded text. OCR transforms a 2D image of text (machine or hand-written) from its image form into a machine-readable text. The OCR process generally consists of several sub-processes: | ||
|
||
Pre-processing of image | ||
Text Localization | ||
Character Segmentation | ||
Character Recognition | ||
Post Processing | ||
|
||
There are alot of OCR software available but one of the most popular is Tesseract. Python Tesseract (Pytesseract) is a Python library that serves as a wrapper for Google’s Tesseract-OCR engine. Essentially, it allows developers to use Tesseract’s OCR engine via Python. |