Headline Splitter Exceptions problems #1598
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When running with a prompt in Portuguese with documents read using LlamaIndex, I noticed that several generated nodes were becoming empty and generating an exception when processed by Embedding, because of this.
Because of this, I added additional treatments to search the text in a way that could avoid problematic comparisons.
I was using PDF documents with text formatting with well-distributed text blocks, causing the reader to get \n and inappropriate spaces, causing failures in the splitter.
GPT 4o mini would often be inspired by the prompt and add headlines to indexes that did not exist since the example prompt uses indexing for this. But often texts have headlines without this indexing. Making the indexing generated by LLM generate incompatibility with the text.
I added part of the code to deal with these details.