Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@file throws error for PDF files #1044

Closed
srdas opened this issue Oct 21, 2024 · 6 comments · Fixed by #1106
Closed

@file throws error for PDF files #1044

srdas opened this issue Oct 21, 2024 · 6 comments · Fixed by #1106
Labels
bug Something isn't working

Comments

@srdas
Copy link
Collaborator

srdas commented Oct 21, 2024

The new feature @file throws an error when a PDF file is passed as context.

@file:GitHub/RAG_Docs/MertonHBR.pdf What does this file pertain to?

The error arises as the @file command does not handle PDF files (as the encoding needs special handling).

Traceback (most recent call last):
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/base.py", line 226, in on_message
    await self.process_message(message)
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/default.py", line 64, in process_message
    context_prompt = await self.make_context_prompt(message)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/default.py", line 75, in make_context_prompt
    await asyncio.gather(
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/base.py", line 159, in make_context_prompt
    return await self._make_context_prompt(message, commands)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/file.py", line 61, in _make_context_prompt
    if (context := self._make_command_context(i))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/file.py", line 90, in _make_command_context
    content = f.read()
              ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Suggested fixes:

  1. Provide a graceful error message that PDF files are not handled.
  2. Extend the command to also handle PDF files by processing them into text and then passing them as context.
@srdas srdas added the bug Something isn't working label Oct 21, 2024
@dlqqq
Copy link
Member

dlqqq commented Oct 21, 2024

More generally, we need a way to not allow @file to be called on binary blob files.

@michaelchia
Copy link
Collaborator

michaelchia commented Oct 21, 2024

I was a bit lazy with this and thought i was being conservative by only supporting what was in jupyter_ai.document_loaders.directory.SUPPORTED_EXTS.

if os.path.splitext(filepath)[1] not in SUPPORTED_EXTS:
raise ContextProviderException(
f"Cannot read unsupported file type '{filepath}' triggered by `{command}`. "
f"Supported file extensions are: {', '.join(SUPPORTED_EXTS)}."
)

I kind of assumed it was only some subset of text-based files and didn't notice .pdf was part of the list. So binary blobs in general should already be blocked.

If were to have a more comprehensive list, should it cover all text-based files or only code related ones? Like .log or .csv files may be very long and may accidentally blowup a token budget. Should it be up to the user to manage this risk themselves? or should we do a size check?

These were some questions I left to be solved in a future PR.

@srdas
Copy link
Collaborator Author

srdas commented Oct 21, 2024

@michaelchia - Thanks for responding so quickly!

  1. I think it should cover all text files, not just code related ones. In fact, I have been using plain text files with the @file command as much of the /learn I do is for single files. As LLM context windows have grown, users are exploiting the longer context windows and @file wonderfully makes this seamless; I'd say users are pretty aware of the cost issues now. However, the idea of a size check is a good one, so if the number of input tokens crosses a limit, say 2K, then pop up a warning and ask to proceed.
  2. As of now, CSV files are not supported.
  3. Maybe update the code to check for files that are not encoded as plain text? This would trap PDF files. But if we want to handle PDFs, it would need a pdf2txt conversion to be added, which can be done separately.

@michaelchia
Copy link
Collaborator

Personally, I don't have any strong opinions whichever way on this. I'll leave it up to you guys to decide what should be supported.

@dlqqq
Copy link
Member

dlqqq commented Oct 21, 2024

Relying on file extensions is not a very reliable method of determining a file's type; see #1030.

I can help offer guidance on a plan for improving file compatibility in @file and /learn more generally, while still allowing extra enhancements for special files on a best-effort basis.

  1. Clearly define what files can be included as context via @file or embedded via /learn, without relying on the extension in the filename. How else can we rigorously, programmatically define what a plaintext file / how we determine a file to be plaintext?
  2. (optional) Determine if it's possible to distinguish between "readable" plaintext (e.g. a documentation page in Markdown) and "unreadable" plaintext (e.g. a PGP private key) files.
  3. Modify @file and /learn to behave like this: if a file is not readable plaintext, try to coerce it to a readable plaintext file on a best-effort basis, based on the file extension / MIME type.
  4. Modify /learn to ignore files that are not readable plaintext and cannot be coerced to readable plaintext, instead of relying on a file extension allowlist.

@srdas
Copy link
Collaborator Author

srdas commented Nov 13, 2024

Further to this, we need to :

  1. Add documentation for @file

  2. Note that /learn does not work with @file as shown here:
    image
    Do we want to mix these commands as it makes using /learn user friendly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants