Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!feat: unify NLTKDocumentSplitter and DocumentSplitter #8617

Merged
merged 30 commits into from
Dec 12, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ffbb5f6
wip: initial import
davidsbatista Dec 5, 2024
c467ce1
wip: refactoring
davidsbatista Dec 5, 2024
d5207e1
wip: refactoring tests
davidsbatista Dec 6, 2024
a6f4c26
wip: refactoring tests
davidsbatista Dec 7, 2024
a17be34
making all NLTKSplitter related tests work
davidsbatista Dec 9, 2024
b32657f
refactoring
davidsbatista Dec 9, 2024
5b30672
docstrings
davidsbatista Dec 9, 2024
1d21e62
refactoring and removing NLTKDocumentSplitter
davidsbatista Dec 9, 2024
4238b63
fixing tests for custom sentence tokenizer
davidsbatista Dec 9, 2024
3cdc2df
fixing tests for custom sentence tokenizer
davidsbatista Dec 9, 2024
d834b39
cleaning up
davidsbatista Dec 9, 2024
3769013
adding release notes
davidsbatista Dec 9, 2024
fdf8f92
Merge branch 'main' into unify-DocumentSplitter-NLTKDocumentSplitter
davidsbatista Dec 9, 2024
0ee7395
reverting some changes
davidsbatista Dec 9, 2024
9080743
Merge branch 'main' into unify-DocumentSplitter-NLTKDocumentSplitter
davidsbatista Dec 9, 2024
9d682f5
cleaning up tests
davidsbatista Dec 10, 2024
09e67fa
fixing serialisation and adding tests
davidsbatista Dec 11, 2024
5653dba
cleaning up
davidsbatista Dec 11, 2024
0802774
wip
davidsbatista Dec 12, 2024
f1745c7
renaming and cleaning
davidsbatista Dec 12, 2024
06803d9
adding NLTK files
davidsbatista Dec 12, 2024
73a0e68
Merge branch 'main' into unify-DocumentSplitter-NLTKDocumentSplitter
davidsbatista Dec 12, 2024
25ab42e
updating docstring
davidsbatista Dec 12, 2024
aca82e4
adding import to init
davidsbatista Dec 12, 2024
47ab319
Update haystack/components/preprocessors/document_splitter.py
davidsbatista Dec 12, 2024
0ecc817
updating tests
davidsbatista Dec 12, 2024
75952d4
wip
davidsbatista Dec 12, 2024
ec03550
adding sentence/period change warning
davidsbatista Dec 12, 2024
ed47997
fixing LICENSE header
davidsbatista Dec 12, 2024
a82614e
Update haystack/components/preprocessors/document_splitter.py
davidsbatista Dec 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions haystack/components/preprocessors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

from .document_cleaner import DocumentCleaner
from .document_splitter import DocumentSplitter
from .nltk_document_splitter import NLTKDocumentSplitter
from .text_cleaner import TextCleaner

__all__ = ["DocumentSplitter", "DocumentCleaner", "TextCleaner", "NLTKDocumentSplitter"]
__all__ = ["DocumentSplitter", "DocumentCleaner", "TextCleaner"]
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
256 changes: 224 additions & 32 deletions haystack/components/preprocessors/document_splitter.py

Large diffs are not rendered by default.

279 changes: 0 additions & 279 deletions haystack/components/preprocessors/nltk_document_splitter.py

This file was deleted.

19 changes: 12 additions & 7 deletions haystack/components/preprocessors/sentence_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,11 +186,16 @@ def _needs_join(
"""
Checks if the spans need to be joined as parts of one sentence.

This method determines whether two adjacent sentence spans should be joined back together as a single sentence.
It's used to prevent incorrect sentence splitting in specific cases like quotations, numbered lists,
and parenthetical expressions.

:param text: The text containing the spans.
:param span: The current sentence span within text.
:param next_span: The next sentence span within text.
:param span: Tuple of (start, end) positions for the current sentence span.
:param next_span: Tuple of (start, end) positions for the next sentence span.
:param quote_spans: All quoted spans within text.
:returns: True if the spans needs to be joined.
:returns:
True if the spans needs to be joined.
"""
start, end = span
next_start, next_end = next_span
Expand All @@ -216,16 +221,16 @@ def _needs_join(
return re.search(r"^\s*[\(\[]", text[next_start:next_end]) is not None

@staticmethod
def _read_abbreviations(language: Language) -> List[str]:
def _read_abbreviations(lang: Language) -> List[str]:
"""
Reads the abbreviations for a given language from the abbreviations file.

:param language: The language to read the abbreviations for.
:param lang: The language to read the abbreviations for.
:returns: List of abbreviations.
"""
abbreviations_file = Path(__file__).parent.parent / f"data/abbreviations/{language}.txt"
abbreviations_file = Path(__file__).parent.parent / f"data/abbreviations/{lang}.txt"
if not abbreviations_file.exists():
logger.warning("No abbreviations file found for {language}.Using default abbreviations.", language=language)
logger.warning("No abbreviations file found for {language}. Using default abbreviations.", language=lang)
return []

abbreviations = abbreviations_file.read_text().split("\n")
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
enhancements:
- |
The NLTKDocumentSplitter was merged into the DocumentSplitter. You can now make use of a more robust sentence boundary detection, for that you need to initialize the DocumentSplitter with `split_by="nltk_sentence"`.
Loading
Loading