text_splitters: Add HTMLSemanticPreservingSplitter #25911

munday-tech · 2024-09-01T03:46:12Z

Description:

With current HTML splitters, they rely on secondary use of the RecursiveCharacterSplitter to further chunk the document into manageable chunks. The issue with this is it fails to maintain important structures such as tables, lists, etc within HTML.

This Implementation of a HTML splitter, allows the user to define a maximum chunk size, HTML elements to preserve in full, options to preserve <a> href links in the output and custom handlers.

The core splitting begins with headers, similar to HTMLHeaderSplitter. If these sections exceed the length of the max_chunk_size further recursive splitting is triggered. During this splitting, elements listed to preserve, will be excluded from the splitting process. This can cause chunks to be slightly larger then the max size, depending on preserved length. However, all contextual relevance of the preserved item remains intact.

Custom Handlers: Sometimes, companies such as Atlassian have custom HTML elements, that are not parsed by default with BeautifulSoup. Custom handlers allows a user to provide a function to be ran whenever a specific html tag is encountered. This allows the user to preserve and gather information within custom html tags that bs4 will potentially miss during extraction.

Dependencies: User will need to install bs4 in their project to utilise this class

I have also added in how_to and unit tests, which require bs4 to run, otherwise they will be skipped.

Flowchart of process:

vercel · 2024-09-01T03:46:16Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 19, 2024 5:05pm

…g Capabilities (#1)

Remove stopword testcase for now due to nltk requirement

baskaryan · 2024-09-03T00:46:56Z

Hey @munday-tech this looks very cool, thanks for the contribution!

The main thing I'm wondering about is how we can make it more clear to end users which html splitter to use and when. I think a new user would be pretty confused right now about how the different html splitters work.

Maybe it would make sense to have a single "How to split HTML" page that shows how the results of the three splitters would be different on the same html file? very open to other ideas as well!

munday-tech · 2024-09-03T10:46:59Z

Thanks @baskaryan.

Totally agree, have a singular page to go to and see the difference in action is a great idea. I've put together a draft of this page, and added it to this PR.

Would be great to get some feedback if this is kind of the direction we want to head with it. May need some cleaning up and re-wording in some places. If not, can do pure ipynb, let me know!

Few notes:

Used mdx to try and reduce clutter on the page with tabs for each splitter.
Added an in-page HTML example so its visual to the reader, however the raw html is still accessible via a expandable code block
At the moment, it does not include max_chunk_size for the Semantic splitter, we would likely need to add in a bigger html doc, and add examples of recursively splitting it with the others to highlight the preservation of elements within chunks.

Happy to build on this and flesh it out a bit more. :)

munday-tech · 2024-09-17T14:08:39Z

Hey @baskaryan sorry for the delay. Been flat out these past two weeks.

I've added a new doc for choosing a splitter now!

munday-tech · 2024-09-30T14:31:20Z

Any updates on this @baskaryan?

ccurme

@munday-tech thanks for this. I've marked the new class as beta for now in case we want to update its behavior as it starts getting used. Can de-beta when needed.

Really nice work on the guides. I've consolidated them into a single "How to split HTML" guide for now based on your comparison guide. Very much appreciate the thought you put into that.

munday-tech added 2 commits September 1, 2024 02:39

Init: HTMLSemanticPreservingSplitter

4f537a5

Add docs, fix <html> issue

22f4c26

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Sep 1, 2024

munday-tech added 2 commits September 1, 2024 13:57

fix mypy linting

796cfe3

fix doc linting errors

c4212e3

vercel bot deployed to Preview September 1, 2024 04:28 View deployment

munday-tech added 2 commits September 1, 2024 16:36

Docs: Add <a> override example

68058d9

fix: minor ruff error

fc2ba8f

vercel bot deployed to Preview September 1, 2024 06:57 View deployment

Merge branch 'master' into master

9da0e0e

vercel bot deployed to Preview September 1, 2024 07:27 View deployment

Enhanced HTMLSemanticPreservingSplitter with Media and Text Processin…

bf3ff8c

…g Capabilities (#1)

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 2, 2024

remove test case

08afd65

Remove stopword testcase for now due to nltk requirement

vercel bot deployed to Preview September 2, 2024 06:37 View deployment

baskaryan added 4 commits September 2, 2024 16:32

Merge branch 'master' into munday-tech/master

54d26fa

fmt

59aec44

nit

731a1e4

lint

0254437

vercel bot deployed to Preview September 3, 2024 00:17 View deployment

munday-tech added 2 commits September 3, 2024 20:35

doc example

dc58c1b

fix: Prevent empty docs

97dfbc9

vercel bot had a problem deploying to Preview September 3, 2024 10:44 Failure

remove links; update index

6afdf2b

Merge branch 'master' into master

c28765c

vercel bot had a problem deploying to Preview September 6, 2024 01:34 Failure

Docs: Add new How to guide

8b3fce3

vercel bot deployed to Preview September 17, 2024 13:57 View deployment

efriis assigned baskaryan Dec 16, 2024

ccurme added 4 commits December 19, 2024 10:36

Merge branch 'master' into munday-tech/master

b740e75

expand html guide

626471c

delete old guides and add redirects

b84afa2

lint

bb03d92

vercel bot deployed to Preview December 19, 2024 16:26 View deployment

ccurme added 2 commits December 19, 2024 11:32

fixes

3b13dba

mark as beta

1943f3f

vercel bot deployed to Preview December 19, 2024 16:51 View deployment

ccurme added 2 commits December 19, 2024 11:54

fix

44d6d33

update cassettes

eebbf57

vercel bot deployed to Preview December 19, 2024 17:05 View deployment

ccurme approved these changes Dec 19, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 19, 2024

ccurme merged commit f696950 into langchain-ai:master Dec 19, 2024
57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text_splitters: Add HTMLSemanticPreservingSplitter #25911

text_splitters: Add HTMLSemanticPreservingSplitter #25911

munday-tech commented Sep 1, 2024

vercel bot commented Sep 1, 2024 •

edited

Loading

baskaryan commented Sep 3, 2024

munday-tech commented Sep 3, 2024

munday-tech commented Sep 17, 2024

munday-tech commented Sep 30, 2024

ccurme left a comment

text_splitters: Add HTMLSemanticPreservingSplitter #25911

text_splitters: Add HTMLSemanticPreservingSplitter #25911

Conversation

munday-tech commented Sep 1, 2024

vercel bot commented Sep 1, 2024 • edited Loading

baskaryan commented Sep 3, 2024

munday-tech commented Sep 3, 2024

munday-tech commented Sep 17, 2024

munday-tech commented Sep 30, 2024

ccurme left a comment

Choose a reason for hiding this comment

vercel bot commented Sep 1, 2024 •

edited

Loading