-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text_splitters: Add HTMLSemanticPreservingSplitter #25911
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
…g Capabilities (#1)
Remove stopword testcase for now due to nltk requirement
Hey @munday-tech this looks very cool, thanks for the contribution! The main thing I'm wondering about is how we can make it more clear to end users which html splitter to use and when. I think a new user would be pretty confused right now about how the different html splitters work. Maybe it would make sense to have a single "How to split HTML" page that shows how the results of the three splitters would be different on the same html file? very open to other ideas as well! |
Thanks @baskaryan. Totally agree, have a singular page to go to and see the difference in action is a great idea. I've put together a draft of this page, and added it to this PR. Would be great to get some feedback if this is kind of the direction we want to head with it. May need some cleaning up and re-wording in some places. If not, can do pure ipynb, let me know! Few notes:
Happy to build on this and flesh it out a bit more. :) |
Hey @baskaryan sorry for the delay. Been flat out these past two weeks. I've added a new doc for choosing a splitter now! |
Any updates on this @baskaryan? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@munday-tech thanks for this. I've marked the new class as beta
for now in case we want to update its behavior as it starts getting used. Can de-beta when needed.
Really nice work on the guides. I've consolidated them into a single "How to split HTML" guide for now based on your comparison guide. Very much appreciate the thought you put into that.
Description:
With current HTML splitters, they rely on secondary use of the
RecursiveCharacterSplitter
to further chunk the document into manageable chunks. The issue with this is it fails to maintain important structures such as tables, lists, etc within HTML.This Implementation of a HTML splitter, allows the user to define a maximum chunk size, HTML elements to preserve in full, options to preserve
<a>
href links in the output and custom handlers.The core splitting begins with headers, similar to
HTMLHeaderSplitter
. If these sections exceed the length of themax_chunk_size
further recursive splitting is triggered. During this splitting, elements listed to preserve, will be excluded from the splitting process. This can cause chunks to be slightly larger then the max size, depending on preserved length. However, all contextual relevance of the preserved item remains intact.Custom Handlers: Sometimes, companies such as Atlassian have custom HTML elements, that are not parsed by default with
BeautifulSoup
. Custom handlers allows a user to provide a function to be ran whenever a specific html tag is encountered. This allows the user to preserve and gather information within custom html tags thatbs4
will potentially miss during extraction.Dependencies: User will need to install
bs4
in their project to utilise this classI have also added in
how_to
and unit tests, which requirebs4
to run, otherwise they will be skipped.Flowchart of process: