Skip to content

Commit

Permalink
Change pdf extractor for using with HF
Browse files Browse the repository at this point in the history
  • Loading branch information
Maclenn77 committed Dec 8, 2023
1 parent 438a5f9 commit 17cda42
Show file tree
Hide file tree
Showing 4 changed files with 13 additions and 7 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,5 @@ Run streamlit run app.py

- Streamlit
- HuggingFace
- Tika: For extracting pdf text
- Java Runtime
- pymupdf for pdf extraction
- An open ai openapi key
12 changes: 9 additions & 3 deletions app.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
""" A simple example of Streamlit. """
import streamlit as st
from tika import parser
import fitz

# from tika import parser
# from openai import OpenAI

# client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

pdf = st.file_uploader("Upload a file", type="pdf")

if st.button("Extract text"):
if pdf is not None:
extracted_text = parser.from_file(pdf)
st.write(extracted_text["content"])
with fitz.open(stream=pdf.read(), filetype="pdf") as doc: # open document
text = chr(12).join([page.get_text() for page in doc])
st.write(text)
else:
st.write("Please upload a file of type: pdf")
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
openai
langchain
tika
pymupdf
chromadb
sentence_transformers
streamlit
2 changes: 1 addition & 1 deletion wk_flow_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
streamlit
tika
pymupdf
pylint

0 comments on commit 17cda42

Please sign in to comment.