-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: 1.x - nltk upgrade, use nltk.download('punkt_tab')
#8256
Conversation
vblagoje
commented
Aug 20, 2024
•
edited
Loading
edited
- We needed to update a few more deps to get a green CI
- We needed to skip nltk preprocessing tests that load pickle models (seems to be forbidden in nltk 3.9)
- fixes Upgrade Haystack 1.x to NLTK 3.9 #8238
I've managed to get the CI to pass. Note the changes in dependencies. It couldn't be done without these and we need to pin a few more dependencies which is ok. The nltk tests that were failing are related to inability to load old models in pickle files, which I think is forbidden now in nltk 3.9.x I'll upgrade this PR draft into a PR |
nltk.download('punkt_tab')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling custom tokenizers is a bigger limitation but for now it's our best option in my opinion. We don't want to re-write how the PreProcessor loads custom models now. Users can still choose to not upgrade to the next Haystack 1.26.x release.
I would make this limitation a bit more evident.
|
I opted for always None-ing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for always None-ing
tokenizer_model_folder
and logging the warning with resolution path.
I agree with your approach.
I left some comments to better understand...
Co-authored-by: Stefano Fiorucci <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for me.
I would prefer that @julian-risch also take a look.
Makes sense 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change
https://github.com/deepset-ai/haystack/blob/nltk_update_exp1/haystack/nodes/preprocessor/preprocessor.py#L932 and https://github.com/deepset-ai/haystack/blob/nltk_update_exp1/haystack/nodes/preprocessor/preprocessor.py#L939
to use the following instead.
from nltk.tokenize.punkt import PunktTokenizer
tokenizer = PunktTokenizer(language_name)
Just like it is done here nltk/nltk@496515e
This is also how I understand the first part of the comment by @sagarneeldubey #8238 (comment)
You could reach out to them directly to understand what changes they made in their custom preprocessor component. And whether this PR can replace their custom preprocessor.