Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0771f8e
Update html.py
AhmedTammaa Oct 28, 2024
c52667a
Merge branch 'master' into patch-1
AhmedTammaa Oct 29, 2024
8dc8e46
Merge branch 'master' into patch-1
AhmedTammaa Nov 8, 2024
d4efd97
Update html.py
AhmedTammaa Nov 8, 2024
73c001c
Update html.py
AhmedTammaa Nov 8, 2024
9119fe9
Update html.py
AhmedTammaa Nov 8, 2024
7e0ce8e
Update html.py
AhmedTammaa Nov 8, 2024
d604fd1
Merge branch 'master' into patch-1
AhmedTammaa Nov 8, 2024
dfe4ee4
Merge branch 'master' into patch-1
eyurtsev Dec 13, 2024
6bfc158
Update html.py
AhmedTammaa Dec 16, 2024
b84f13c
Update test_text_splitters.py
AhmedTammaa Dec 16, 2024
6a2f1e9
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
17ae8b9
added import Tuple
AhmedTammaa Dec 17, 2024
be9de90
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
851ba7e
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
0306951
added beautifulsoup4 to poetry depedencies
AhmedTammaa Dec 17, 2024
09e7852
Merge branch 'master' into patch-1
AhmedTammaa Dec 18, 2024
ae50b32
discarded bs4 dependency
AhmedTammaa Dec 18, 2024
f9a93d0
Removed uncessary module docstring, updated docstring of HTMLHeaderTe…
AhmedTammaa Dec 18, 2024
438aedd
improved docstring for the class `HTMLHeaderTextSplitter`
AhmedTammaa Dec 18, 2024
d573723
removed typing from docstring when type is hinted.
AhmedTammaa Dec 18, 2024
405ea70
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
f6e45e2
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
617e04a
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
b82bfc9
added pytest mark require bs4
AhmedTammaa Dec 19, 2024
4297787
added requirement bs4 marker for the test cases
AhmedTammaa Dec 19, 2024
c2107b1
all test function involving HTMLHeaderTextSplitter has bs4 requirment…
AhmedTammaa Dec 19, 2024
4261885
added bs4 import in the split_file_function and removed it from top l…
AhmedTammaa Dec 19, 2024
567318a
fixing linting errors and improved documentation for HTMLHeaderTextSp…
AhmedTammaa Dec 19, 2024
53685eb
fixed docstring issue and sorted imports
AhmedTammaa Dec 19, 2024
9ff0bfa
sorted imports and defined `nodes` in `_generate_documents` docstring
AhmedTammaa Dec 19, 2024
aeae28c
updated import order
AhmedTammaa Dec 19, 2024
e67f6bd
fixed all linting issues with Ruff
AhmedTammaa Dec 20, 2024
3b8a547
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
cdd62b7
removed extra blank space from `_finalize_chunk`
AhmedTammaa Dec 20, 2024
b4d4e57
added types for untyped function paramters. Typed `stack` variable as…
AhmedTammaa Dec 20, 2024
d7ea998
fixed "line too long" in test_text_splitters
AhmedTammaa Dec 20, 2024
2bf3726
fixed linter issues in test_text_splitter.py
AhmedTammaa Dec 20, 2024
7dd9f15
fixed mypy issues
AhmedTammaa Dec 20, 2024
456c36a
fixed all formatting issues and checked with pre-commit
AhmedTammaa Dec 20, 2024
533bc90
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
f31e4b7
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
bbe5616
simplified HTMLHeaderSplitter Logic
AhmedTammaa Dec 21, 2024
5637dc7
improved documentation and formatting
AhmedTammaa Dec 21, 2024
4aaa912
Merge branch 'master' into patch-1
AhmedTammaa Dec 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Loading