Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headline Splitter Exceptions problems #1598

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joaorura
Copy link
Contributor

When running with a prompt in Portuguese with documents read using LlamaIndex, I noticed that several generated nodes were becoming empty and generating an exception when processed by Embedding, because of this.
Because of this, I added additional treatments to search the text in a way that could avoid problematic comparisons.
I was using PDF documents with text formatting with well-distributed text blocks, causing the reader to get \n and inappropriate spaces, causing failures in the splitter.
GPT 4o mini would often be inspired by the prompt and add headlines to indexes that did not exist since the example prompt uses indexing for this. But often texts have headlines without this indexing. Making the indexing generated by LLM generate incompatibility with the text.
I added part of the code to deal with these details.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 29, 2024
@jjmachan jjmachan requested a review from shahules786 November 8, 2024 07:05
@jjmachan
Copy link
Member

jjmachan commented Nov 8, 2024

@shahules786 should we merge these changes? I don't know if @joaorura will be working on this anymore

@joaorura
Copy link
Contributor Author

joaorura commented Nov 8, 2024

I will take a look at the issues and fix the PR.

@shahules786
Copy link
Member

Hey @joaorura I have made several changes to headline splitter and extraction in last weeks. Does any of those fix your issue? Additionally, can you explain the issue with an example

@joaorura
Copy link
Contributor Author

@shahules786

I'll take a look at the changes you mentioned and see if they solve the problem.

I need some time to run the project again and get some examples. Unfortunately, I made the mistake of not saving it to a file to present here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants