`@file` throws error for PDF files #1044

srdas · 2024-10-21T17:12:37Z

The new feature @file throws an error when a PDF file is passed as context.

@file:GitHub/RAG_Docs/MertonHBR.pdf What does this file pertain to?

The error arises as the @file command does not handle PDF files (as the encoding needs special handling).

Traceback (most recent call last):
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/base.py", line 226, in on_message
    await self.process_message(message)
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/default.py", line 64, in process_message
    context_prompt = await self.make_context_prompt(message)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/default.py", line 75, in make_context_prompt
    await asyncio.gather(
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/base.py", line 159, in make_context_prompt
    return await self._make_context_prompt(message, commands)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/file.py", line 61, in _make_context_prompt
    if (context := self._make_command_context(i))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/file.py", line 90, in _make_command_context
    content = f.read()
              ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Suggested fixes:

Provide a graceful error message that PDF files are not handled.
Extend the command to also handle PDF files by processing them into text and then passing them as context.

The text was updated successfully, but these errors were encountered:

dlqqq · 2024-10-21T18:13:46Z

More generally, we need a way to not allow @file to be called on binary blob files.

michaelchia · 2024-10-21T18:41:43Z

I was a bit lazy with this and thought i was being conservative by only supporting what was in jupyter_ai.document_loaders.directory.SUPPORTED_EXTS.

jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/file.py

Lines 83 to 87 in 9630742

    
           if os.path.splitext(filepath)[1] not in SUPPORTED_EXTS: 
        
               raise ContextProviderException( 
        
                   f"Cannot read unsupported file type '{filepath}' triggered by `{command}`. " 
        
                   f"Supported file extensions are: {', '.join(SUPPORTED_EXTS)}." 
        
               )

I kind of assumed it was only some subset of text-based files and didn't notice .pdf was part of the list. So binary blobs in general should already be blocked.

If were to have a more comprehensive list, should it cover all text-based files or only code related ones? Like .log or .csv files may be very long and may accidentally blowup a token budget. Should it be up to the user to manage this risk themselves? or should we do a size check?

These were some questions I left to be solved in a future PR.

srdas · 2024-10-21T18:58:49Z

@michaelchia - Thanks for responding so quickly!

I think it should cover all text files, not just code related ones. In fact, I have been using plain text files with the @file command as much of the /learn I do is for single files. As LLM context windows have grown, users are exploiting the longer context windows and @file wonderfully makes this seamless; I'd say users are pretty aware of the cost issues now. However, the idea of a size check is a good one, so if the number of input tokens crosses a limit, say 2K, then pop up a warning and ask to proceed.
As of now, CSV files are not supported.
Maybe update the code to check for files that are not encoded as plain text? This would trap PDF files. But if we want to handle PDFs, it would need a pdf2txt conversion to be added, which can be done separately.

michaelchia · 2024-10-21T19:02:30Z

Personally, I don't have any strong opinions whichever way on this. I'll leave it up to you guys to decide what should be supported.

dlqqq · 2024-10-21T22:35:23Z

Relying on file extensions is not a very reliable method of determining a file's type; see #1030.

I can help offer guidance on a plan for improving file compatibility in @file and /learn more generally, while still allowing extra enhancements for special files on a best-effort basis.

Clearly define what files can be included as context via @file or embedded via /learn, without relying on the extension in the filename. How else can we rigorously, programmatically define what a plaintext file / how we determine a file to be plaintext?
(optional) Determine if it's possible to distinguish between "readable" plaintext (e.g. a documentation page in Markdown) and "unreadable" plaintext (e.g. a PGP private key) files.
Modify @file and /learn to behave like this: if a file is not readable plaintext, try to coerce it to a readable plaintext file on a best-effort basis, based on the file extension / MIME type.
Modify /learn to ignore files that are not readable plaintext and cannot be coerced to readable plaintext, instead of relying on a file extension allowlist.

srdas · 2024-11-13T14:45:09Z

Further to this, we need to :

Add documentation for @file
Note that /learn does not work with @file as shown here:

Do we want to mix these commands as it makes using /learn user friendly?

srdas added the bug Something isn't working label Oct 21, 2024

srdas mentioned this issue Nov 13, 2024

Catch error on non plaintext files in @file and reply gracefully in chat #1106

Merged

dlqqq closed this as completed in #1106 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`@file` throws error for PDF files #1044

`@file` throws error for PDF files #1044

srdas commented Oct 21, 2024

dlqqq commented Oct 21, 2024

michaelchia commented Oct 21, 2024 •

edited

Loading

srdas commented Oct 21, 2024 •

edited

Loading

michaelchia commented Oct 21, 2024

dlqqq commented Oct 21, 2024

srdas commented Nov 13, 2024

@file throws error for PDF files #1044

@file throws error for PDF files #1044

Comments

srdas commented Oct 21, 2024

dlqqq commented Oct 21, 2024

michaelchia commented Oct 21, 2024 • edited Loading

srdas commented Oct 21, 2024 • edited Loading

michaelchia commented Oct 21, 2024

dlqqq commented Oct 21, 2024

srdas commented Nov 13, 2024

`@file` throws error for PDF files #1044

`@file` throws error for PDF files #1044

michaelchia commented Oct 21, 2024 •

edited

Loading

srdas commented Oct 21, 2024 •

edited

Loading