This Python script efficiently extracts text content from multiple PDF files within a designated folder and saves the extracted text as separate TXT files with the same name as the original PDFs (excluding the ".pdf" extension).
- Processes multiple PDFs in a single run.
- Preserves the original file structure for easy identification.
- Utilizes the well-established PyPDF2 library for robust PDF handling.
- Python 3.x (https://www.python.org/downloads/)
- PyPDF2 library (installation:
pip install PyPDF2
)
-
Clone the Repository:
git clone https://github.com/akumathedyn123/python-pdf-extractor-pdf2txt.git
-
Navigate to the Project Directory:
cd pdf_extractor-pdf2txt
-
Set the PDF Folder Path:
- Open the
main.py
file in a text editor. - Locate the line that defines the
pdf_folder
variable (usually near the beginning). - Replace
"path/to/folder"
with the absolute path to the directory containing your PDF files.
Example: If your PDFs are in a folder named
my_pdfs
on your desktop, you would change the line to:pdf_folder = os.path.join(os.path.expanduser('~'), 'Desktop', 'my_pdfs')
- Open the
-
Run the Script:
- From the project directory (where
main.py
is located), execute the script using the following command:
python main.py
Note: If you're using Python 3, you might need to replace
python
withpython3
depending on your system setup. - From the project directory (where
This project is licensed under the MIT License (see LICENSE file for details).
We encourage contributions to this project. Feel free to submit pull requests for bug fixes, enhancements, or new features.
For any questions or feedback, please feel free to create an issue on the project's GitHub repository.