Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!feat: unify NLTKDocumentSplitter and DocumentSplitter #8617

Merged
merged 30 commits into from
Dec 12, 2024

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Dec 9, 2024

Related Issues

Proposed Changes:

  • Refactor the DocumentSplitter to have 3 main splitting cases: by character, by function and by the new custom sentence tokenizer based on NLTK

How did you test it?

  • merged tests from NLTKDocumenSplitter, removed duplicated ones and added new ones, also moved some and created new ones for the SentenceSplitter, which is called by the DocumentSplitter

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Dec 9, 2024
@coveralls
Copy link
Collaborator

coveralls commented Dec 9, 2024

Pull Request Test Coverage Report for Build 12298154884

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 90.477%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/sentence_tokenizer.py 1 94.05%
Totals Coverage Status
Change from base Build 12292536037: 0.1%
Covered Lines: 8104
Relevant Lines: 8957

💛 - Coveralls

@davidsbatista davidsbatista marked this pull request as ready for review December 10, 2024 10:04
@davidsbatista davidsbatista requested review from a team as code owners December 10, 2024 10:04
@davidsbatista davidsbatista requested review from dfokina, mpangrazzi and vblagoje and removed request for a team December 10, 2024 10:04
@davidsbatista davidsbatista changed the title Unify document splitter nltk document splitter feat: unify NLTKDocumentSplitter and DocumentSplitter Dec 10, 2024
@vblagoje
Copy link
Member

@davidsbatista do we have to update (de)serialization with these new init params?

@davidsbatista
Copy link
Contributor Author

@anakin87 thanks for reminding that - I forgot about it - here's the PR for deprecating NLTKDocumentSplitter - #8628

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain better what needs to be done to comply with our deprecation policy:

  • in this PR, do not remove NLTKDocumentSplitter - leave the file untouched
  • deprecate NLTKDocumentSplitter - it can be done in the other PR that you created
  • these 2 PRs will be incorporated in 2.9.0
  • only after the 2.9.0 release, remove NLTKDocumentSplitter

Ping me if anything is unclear

@@ -494,3 +498,301 @@ def test_run_document_only_whitespaces(self):
doc = Document(content=" ")
results = splitter.run([doc])
assert results["documents"][0].content == " "


class TestSplittingNLTKSentenceSplitter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm did all tests from the test_nltk_document_splitter.py file make it into here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not, only the ones that over NLTK-specific code since there were also tests, in NLTKDocumentSplitter, to cover the same functionality as in DocumentSplitter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed something, I will double check

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please incorporate the suggested change and then feel free to merge.

haystack/components/preprocessors/document_splitter.py Outdated Show resolved Hide resolved
@davidsbatista
Copy link
Contributor Author

can someone approve this one: #8628

@davidsbatista davidsbatista enabled auto-merge (squash) December 12, 2024 14:18
@davidsbatista davidsbatista merged commit 3f77d3a into main Dec 12, 2024
18 checks passed
@davidsbatista davidsbatista deleted the unify-DocumentSplitter-NLTKDocumentSplitter branch December 12, 2024 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify DocumentSplitter and NLTKDocumentSplitter
6 participants