Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Move potential nltk download to warm_up #8646

Merged
merged 8 commits into from
Dec 20, 2024
Merged

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Dec 16, 2024

Related Issues

  • fixes #issue-number

Proposed Changes:

Move the download of the NLTK files to the warm up method. In general we try to use warm_up to house expensive operations and also calls to the internet.

Also specifically the validation service we use in dC requires initializing a pipeline which we do in a threaded way to speed things up. This creates the requirement that components init methods need to be thread safe. The download of the punkt files isn't thread safe which does cause thread locking to occur in the validation service.

How did you test it?

updated tests

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@sjrl sjrl requested a review from a team as a code owner December 16, 2024 16:54
@sjrl sjrl requested review from julian-risch and removed request for a team December 16, 2024 16:54
@github-actions github-actions bot added the type:documentation Improvements on the docs label Dec 16, 2024
@sjrl sjrl requested a review from a team as a code owner December 16, 2024 17:13
@sjrl sjrl requested review from dfokina and removed request for a team December 16, 2024 17:13
@coveralls
Copy link
Collaborator

coveralls commented Dec 16, 2024

Pull Request Test Coverage Report for Build 12398909766

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 10 unchanged lines in 5 files lost coverage.
  • Overall coverage increased (+0.2%) to 90.644%

Files with Coverage Reduction New Missed Lines %
components/generators/openai_utils.py 1 83.33%
components/preprocessors/document_splitter.py 1 99.51%
components/generators/chat/hugging_face_api.py 2 97.67%
dataclasses/chat_message.py 2 98.69%
components/preprocessors/nltk_document_splitter.py 4 96.72%
Totals Coverage Status
Change from base Build 12353137414: 0.2%
Covered Lines: 8303
Relevant Lines: 9160

💛 - Coveralls

@sjrl sjrl requested a review from wochinge December 17, 2024 07:11
Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌🏻

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@bilgeyucel tutorial 42 will break when this PR gets merged. We just need to add a splitter.warm_up() call. In cookbooks, I expect only this notebooks/rag_fastembed.ipynb to fail.

@sjrl I had a look at deserialization methods and noticed that DocumentSplitter and NLTKDocumentSplitter differ in how they treat splitting function. I will open a separate issue about adding a to_dict to NLTKDocumentSplitter, sth like:

    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> "NLTKDocumentSplitter":
        """
        Deserializes the component from a dictionary.
        """
        init_params = data.get("init_parameters", {})

        splitting_function = init_params.get("splitting_function", None)
        if splitting_function:
            init_params["splitting_function"] = deserialize_callable(splitting_function)

        return default_from_dict(cls, data)

@sjrl sjrl merged commit 286061f into main Dec 20, 2024
18 checks passed
@sjrl sjrl deleted the nltk-download-warm-up branch December 20, 2024 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants