Data extraction from a pdf file

A script to extract text data from a pdf file, converts it to pandas data frame and saves it in to a csv file.

Installation

You can clone below repository:
git clone https://github.com/serhatci/data-extraction-from-pdf.git

install the requirements:
pip install -r requirements.txt

Be sure following pdf files are in the script folder:
ITRCAnnualReportPdf2019.pdf
ITRCAnnualReportPdf2018.pdf

and run the application:
python script/pdf_data_extractor.py

Script works Python 3.7 or higher version.

Below libraries should be installed:

pip install pdfplumber~=0.5.25
pip install pandas~=0.25.1

Below image represents the format of pdf file and the extracted data in the CSV file.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
script		script
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
readme.jpg		readme.jpg
requirements.txt		requirements.txt