Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text_splitters: Add HTMLSemanticPreservingSplitter #25911

Merged
merged 26 commits into from
Dec 19, 2024

Conversation

munday-tech
Copy link
Contributor

Description:

With current HTML splitters, they rely on secondary use of the RecursiveCharacterSplitter to further chunk the document into manageable chunks. The issue with this is it fails to maintain important structures such as tables, lists, etc within HTML.

This Implementation of a HTML splitter, allows the user to define a maximum chunk size, HTML elements to preserve in full, options to preserve <a> href links in the output and custom handlers.

The core splitting begins with headers, similar to HTMLHeaderSplitter. If these sections exceed the length of the max_chunk_size further recursive splitting is triggered. During this splitting, elements listed to preserve, will be excluded from the splitting process. This can cause chunks to be slightly larger then the max size, depending on preserved length. However, all contextual relevance of the preserved item remains intact.

Custom Handlers: Sometimes, companies such as Atlassian have custom HTML elements, that are not parsed by default with BeautifulSoup. Custom handlers allows a user to provide a function to be ran whenever a specific html tag is encountered. This allows the user to preserve and gather information within custom html tags that bs4 will potentially miss during extraction.

Dependencies: User will need to install bs4 in their project to utilise this class

I have also added in how_to and unit tests, which require bs4 to run, otherwise they will be skipped.

Flowchart of process:

HTMLSemanticPreservingSplitter

Copy link

vercel bot commented Sep 1, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 19, 2024 5:05pm

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Sep 1, 2024
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 2, 2024
Remove stopword testcase for now due to nltk requirement
@baskaryan
Copy link
Collaborator

Hey @munday-tech this looks very cool, thanks for the contribution!

The main thing I'm wondering about is how we can make it more clear to end users which html splitter to use and when. I think a new user would be pretty confused right now about how the different html splitters work.

Maybe it would make sense to have a single "How to split HTML" page that shows how the results of the three splitters would be different on the same html file? very open to other ideas as well!

@munday-tech
Copy link
Contributor Author

Thanks @baskaryan.

Totally agree, have a singular page to go to and see the difference in action is a great idea. I've put together a draft of this page, and added it to this PR.

Would be great to get some feedback if this is kind of the direction we want to head with it. May need some cleaning up and re-wording in some places. If not, can do pure ipynb, let me know!

Few notes:

  • Used mdx to try and reduce clutter on the page with tabs for each splitter.
  • Added an in-page HTML example so its visual to the reader, however the raw html is still accessible via a expandable code block
  • At the moment, it does not include max_chunk_size for the Semantic splitter, we would likely need to add in a bigger html doc, and add examples of recursively splitting it with the others to highlight the preservation of elements within chunks.

Happy to build on this and flesh it out a bit more. :)

@munday-tech
Copy link
Contributor Author

Hey @baskaryan sorry for the delay. Been flat out these past two weeks.

I've added a new doc for choosing a splitter now!

@munday-tech
Copy link
Contributor Author

Any updates on this @baskaryan?

Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@munday-tech thanks for this. I've marked the new class as beta for now in case we want to update its behavior as it starts getting used. Can de-beta when needed.

Really nice work on the guides. I've consolidated them into a single "How to split HTML" guide for now based on your comparison guide. Very much appreciate the thought you put into that.

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 19, 2024
@ccurme ccurme merged commit f696950 into langchain-ai:master Dec 19, 2024
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder lgtm PR looks good. Use to confirm that a PR is ready for merging. size:XXL This PR changes 1000+ lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants