Skip to content

Commit

Permalink
Handle single files, pdfs, errors from missing loader dependencies in…
Browse files Browse the repository at this point in the history
… `/learn` (#733)

* Handle single files, pdfs, errors

(1) Enables handling single files, not just directories.
(2) Learns PDFs with langchain's PyPDFLoader.
(3) Gives a clean error w/o traceback when the file type that is being handled needs addtional packages.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* error handling for missing packages in learn.py

Removed the extra attribute and additional response comments based on feedback from Piyush Jain and Andrii Ieroshenko

* Amend error message for failure in learn.py

Made the error message more generic as there are many different failure types.

* Fixed build error.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piyush Jain <[email protected]>
  • Loading branch information
3 people authored Apr 16, 2024
1 parent 10657eb commit e27293c
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 12 deletions.
18 changes: 10 additions & 8 deletions packages/jupyter-ai/jupyter_ai/chat_handlers/learn.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,13 +118,16 @@ async def process_message(self, message: HumanChatMessage):
if args.verbose:
self.reply(f"Loading and splitting files for {load_path}", message)

await self.learn_dir(
load_path, args.chunk_size, args.chunk_overlap, args.all_files
)
self.save()

response = f"""🎉 I have learned documents at **{load_path}** and I am ready to answer questions about them.
You can ask questions about these docs by prefixing your message with **/ask**."""
try:
await self.learn_dir(
load_path, args.chunk_size, args.chunk_overlap, args.all_files
)
except Exception as e:
response = f"""Learn documents in **{load_path}** failed. {str(e)}."""
else:
self.save()
response = f"""🎉 I have learned documents at **{load_path}** and I am ready to answer questions about them.
You can ask questions about these docs by prefixing your message with **/ask**."""
self.reply(response, message)

def _build_list_response(self):
Expand Down Expand Up @@ -155,7 +158,6 @@ async def learn_dir(

delayed = split(path, all_files, splitter=splitter)
doc_chunks = await dask_client.compute(delayed)

em_provider_cls, em_provider_args = self.get_embedding_provider()
delayed = get_embeddings(doc_chunks, em_provider_cls, em_provider_args)
embedding_records = await dask_client.compute(delayed)
Expand Down
5 changes: 2 additions & 3 deletions packages/jupyter-ai/jupyter_ai/document_loaders/directory.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,12 @@
from langchain.document_loaders import PyPDFLoader
from langchain.schema import Document
from langchain.text_splitter import TextSplitter
from pypdf import PdfReader


# Uses pypdf which is used by PyPDFLoader from langchain
def pdf_to_text(path):
reader = PdfReader(path)
text = "\n \n".join([page.extract_text() for page in reader.pages])
pages = PyPDFLoader(path)
text = "\n \n".join([page.page_content for page in pages.load_and_split()])
return text


Expand Down
2 changes: 1 addition & 1 deletion packages/jupyter-ai/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ test = [

dev = ["jupyter_ai_magics[dev]"]

all = ["jupyter_ai_magics[all]"]
all = ["jupyter_ai_magics[all]", "pypdf"]

[tool.hatch.version]
source = "nodejs"
Expand Down

0 comments on commit e27293c

Please sign in to comment.