-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Move potential nltk download to warm_up #8646
Conversation
Pull Request Test Coverage Report for Build 12398909766Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
@bilgeyucel tutorial 42 will break when this PR gets merged. We just need to add a splitter.warm_up()
call. In cookbooks, I expect only this notebooks/rag_fastembed.ipynb to fail.
@sjrl I had a look at deserialization methods and noticed that DocumentSplitter
and NLTKDocumentSplitter
differ in how they treat splitting function. I will open a separate issue about adding a to_dict
to NLTKDocumentSplitter
, sth like:
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "NLTKDocumentSplitter":
"""
Deserializes the component from a dictionary.
"""
init_params = data.get("init_parameters", {})
splitting_function = init_params.get("splitting_function", None)
if splitting_function:
init_params["splitting_function"] = deserialize_callable(splitting_function)
return default_from_dict(cls, data)
Related Issues
Proposed Changes:
Move the download of the NLTK files to the warm up method. In general we try to use warm_up to house expensive operations and also calls to the internet.
Also specifically the validation service we use in dC requires initializing a pipeline which we do in a threaded way to speed things up. This creates the requirement that components init methods need to be thread safe. The download of the punkt files isn't thread safe which does cause thread locking to occur in the validation service.
How did you test it?
updated tests
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.