Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: 1.x - nltk upgrade, use nltk.download('punkt_tab') #8256

Merged
merged 15 commits into from
Aug 29, 2024

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Aug 20, 2024

  • We needed to update a few more deps to get a green CI
  • We needed to skip nltk preprocessing tests that load pickle models (seems to be forbidden in nltk 3.9)
  • fixes Upgrade Haystack 1.x to NLTK 3.9 #8238

@vblagoje
Copy link
Member Author

I've managed to get the CI to pass. Note the changes in dependencies. It couldn't be done without these and we need to pin a few more dependencies which is ok.

The nltk tests that were failing are related to inability to load old models in pickle files, which I think is forbidden now in nltk 3.9.x

I'll upgrade this PR draft into a PR

@vblagoje vblagoje changed the title draft: Nltk update exp1 fix: 1.x - nltk upgrade, use nltk.download('punkt_tab') Aug 20, 2024
@vblagoje vblagoje marked this pull request as ready for review August 20, 2024 11:56
@vblagoje vblagoje requested review from a team as code owners August 20, 2024 11:56
@vblagoje vblagoje requested review from dfokina, Amnah199, anakin87, julian-risch and silvanocerza and removed request for a team and Amnah199 August 20, 2024 11:56
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling custom tokenizers is a bigger limitation but for now it's our best option in my opinion. We don't want to re-write how the PreProcessor loads custom models now. Users can still choose to not upgrade to the next Haystack 1.26.x release.

@anakin87
Copy link
Member

I would make this limitation a bit more evident.

  • if we don't want to suppress the parameter tokenizer_model_folder, we can log a clear warning.
  • let's also add an upgrade entry in the release note.

@silvanocerza silvanocerza removed their request for review August 22, 2024 07:58
@vblagoje
Copy link
Member Author

I would make this limitation a bit more evident.

  • if we don't want to suppress the parameter tokenizer_model_folder, we can log a clear warning.
  • let's also add an upgrade entry in the release note.

I opted for always None-ing tokenizer_model_folder and logging the warning with resolution path. This way we don't have to touch the codebase much and cause some unintended consequences. LMK if you have a better proposal @julian-risch @anakin87

@vblagoje vblagoje requested a review from julian-risch August 29, 2024 08:16
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for always None-ing tokenizer_model_folder and logging the warning with resolution path.

I agree with your approach.

I left some comments to better understand...

pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
vblagoje and others added 2 commits August 29, 2024 10:57
@anakin87 anakin87 self-requested a review August 29, 2024 09:25
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me.

I would prefer that @julian-risch also take a look.

@vblagoje
Copy link
Member Author

OK for me.

I would prefer that @julian-risch also take a look.

Makes sense 🙏

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change
https://github.com/deepset-ai/haystack/blob/nltk_update_exp1/haystack/nodes/preprocessor/preprocessor.py#L932 and https://github.com/deepset-ai/haystack/blob/nltk_update_exp1/haystack/nodes/preprocessor/preprocessor.py#L939
to use the following instead.

from nltk.tokenize.punkt import PunktTokenizer
tokenizer = PunktTokenizer(language_name)

Just like it is done here nltk/nltk@496515e

This is also how I understand the first part of the comment by @sagarneeldubey #8238 (comment)
You could reach out to them directly to understand what changes they made in their custom preprocessor component. And whether this PR can replace their custom preprocessor.

@vblagoje vblagoje merged commit 8c95fab into v1.26.x Aug 29, 2024
57 checks passed
@vblagoje vblagoje deleted the nltk_update_exp1 branch August 29, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants