Welcome to the repository for our research paper on automating the data extraction process of macular cube spectral domain optical coherence tomography (SD-OCT) data using optical character recognition (OCR) and deep learning. The algorithm we developed, named OCTess (portmanteau of OCT and Tesseract), is highly accurate, efficient, and a time-saving alternative to manual data extraction.
In this study, we focused on developing an OCR algorithm, OCTess, to automatically extract clinical and demographic data from Cirrus SD-OCT macular cube reports. Our algorithm utilizes multiple models from Tesseract, an open-source OCR software library, and leverages pixel-based bounding box coordinates for each field of interest in the macular cube report. The extracted data is processed through a series of image processing operations to convert it to text.
OCTess extracts SD-OCT macular cube data with near-perfect and equivalent accuracy to a human while being significantly more efficient.
To use OCTess, please follow these steps:
- Clone this repository
- Ensure you have the required dependencies installed, as listed in
requirements.txt
- Move your Cirrus SD-OCT PDF/PNG files into the
Input/
directory. Alternatively, you can use the 5 example files that are already provided - Run the bash script
./run.sh
to execute the OCR algorithm and validate the results using the provided dataset
Input/
: Input your raw SD-OCT macular cube reports in this directory. Delete the example files if you do not need them
tessdata/
: Directory of saved Tesseract deep learning and legacy models
patterns/
: Regex pattern rules used for data extraction
pdf_to_img.py
: Python script to convert PDF files to PNG format (if they are not already PNG)
extract_OCT.py
: Python script to extract data from each PNG file, organize it into a table and generate OCTess.xlsx
verify_OCT.py
: Python script that performs a series of verifications and highlights regions of OCTess.xlsx
that may be erroneous
requirements.txt
: Lists the necessary dependencies for this project
We welcome contributions to improve the algorithm or expand its applicability. Please feel free to submit issues, pull requests, or contact the authors directly.
Michael Balas: [email protected]
Rajeev H. Muni: [email protected]
This project is licensed under the GNU GPLv3 License. See the LICENSE
file for details.
If you use this code or the results from our research paper, please cite our work:
Balas, M., Herman, J., Bhambra, N., Longwell, J., Popovic, M., Melo, I., & Muni, R. (2023). OCTess: An Optical Character Recognition Algorithm for Automated Data Extraction of Spectral Domain Optical Coherence Tomography Reports. RETINA. https://doi.org/10.0000/00000