The PDF Searcher is a Python project that provides a gRPC API for uploading, searching, and summarizing documents. It includes a gRPC server for handling document-related operations and can be easily deployed using Docker. It also uses chromadb which is a vector database to store the data. After uploading the documents through the gRPC API the project converts the pdf file to text and embeds the text using all-MiniLM-L6-v2 which is a sentence-transformers model.
In addition to uploading documents and adding them to the database, you can call the APIs to search for a query in the database and summarizing your texts.
Make sure you have the following prerequisites installed:
- Python 3.8
- Docker
- Other dependencies (specified in
requirements.txt
)
-
Clone the repository:
git clone https://github.com/kian79/PDF-searcher.git
-
Navigate to the project directory:
cd PDF_searcher
-
Install dependencies:
pip install -r requirements.txt
-
Run the gRPC server:
PYTHONPATH=.:.. python grpc_api/server.py
The server should be running on localhost:50051.
- Build the Docker image:
docker build -t pdf_searcher .
- Run the Docker container:
docker run -p 50051:50051 pdf_searcher
The gRPC server should be accessible on localhost:50051.
To interact with the Document Service, you can use the provided gRPC client script or integrate the service into your own Python applications.
Example usage in Python client:
import grpc
import document_service_pb2 as pb2
import document_service_pb2_grpc as pb2_grpc
def upload_document(file_content, document_name):
with grpc.insecure_channel("localhost:50051") as channel:
stub = pb2_grpc.DocumentServiceStub(channel)
request = pb2.UploadRequest(file_content=file_content, document_name=document_name)
response = stub.UploadDocument(request)
return response.document_id
# Other client functions...
# Example usage:
with open("path/to/your/document.pdf", "rb") as file:
pdf_content = file.read()
document_id = upload_document(pdf_content)
print(f"Uploaded document with ID: {document_id}")