-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strategy: html #22
Comments
Created a PR: #23 Please let me know how I can improve it. |
Just saw the metadata branch: https://github.com/revelrylabs/text_chunker_ex/tree/14-metadata-to-chunk I think this could play well with that following this sort of strategy: https://blog.langchain.dev/a-chunk-by-any-other-name#Q+A-with-Structured-Chunking |
This is absolutely the kind of thing we want to support; thank you for your contribution! That article is very interesting. I let that metadata branch go, because it seemed like the data I was adding during the splitting wasn't actually relevant to the splitting itself. However, adding the name of any given HTML or markdown section to metadata per chunk might be a worthy cause. Hell, if we want to split on functions and modules, having that information in the chunk itself just sounds like more context for the chunk, which sounds great. In the meantime, HTML splitters to split a given document according to its own explicit informational structure are much appreciated ❤️ |
Are there any plans or interest in an html chunking strategy?
There's some ideas here: https://medium.com/unstructured-io/easy-web-scraping-and-chunking-by-document-elements-for-large-language-models-c45d13aca8dd
The text was updated successfully, but these errors were encountered: