Named entity extractor private models #8658

mpangrazzi · 2024-12-18T16:08:43Z

Related Issues

fixes NamedEntityExtractor not usable with private models #8633

Proposed Changes:

I've added token support to NamedEntityExtractor and then update HF backend to pass it down to tokenizer and model from_pretrained methods.
I've duplicated public dslim/bert-base-NER to private deepset/bert-base-NER and added required tests.
I've also fixed formatting of an existing error message about failed backend initialization.

How did you test it?

Local unit testing / e2e

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

…odels on HF backend

coveralls · 2024-12-18T16:14:17Z

Pull Request Test Coverage Report for Build 12429166430

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
122 unchanged lines in 8 files lost coverage.
Overall coverage increased (+0.03%) to 90.687%

Files with Coverage Reduction	New Missed Lines	%
components/preprocessors/document_splitter.py	1	99.51%
components/generators/chat/hugging_face_api.py	4	96.4%
components/preprocessors/nltk_document_splitter.py	4	96.72%
utils/hf.py	7	86.29%
utils/type_serialization.py	7	89.04%
components/generators/chat/hugging_face_local.py	12	85.32%
core/pipeline/base.py	32	88.64%
components/extractors/named_entity_extractor.py	55	68.28%

Totals
Change from base Build 12392158409:	0.03%
Covered Lines:	8374
Relevant Lines:	9234

💛 - Coveralls

julian-risch

The code changes look quite good to me already. We only need to update from_dict/to_dict methods and I have a question about one test case and the comment in there, which I find confusing. The test won't and shouldn't fail if the HF_API_TOKEN is not set. So let's simply remove the comment. Otherwise that would mean we are trying to download the model and in that case the test should be marked with

    @pytest.mark.integration
    @pytest.mark.skipif(
        not os.environ.get("HF_API_TOKEN", None),
        reason="Export an env var called HF_API_TOKEN containing the Hugging Face token to run this test.",
    )

For how to update from_dict/to_dict you can have a look at other components, for example HuggingFaceLocalGenerator. We should also update the tests to have a test_to_dict_default and test_to_dict_with_parameters. The tests with ...no_default_parameters_hf should use monkeypatch.delenv("HF_API_TOKEN", raising=False)

test/components/extractors/test_named_entity_extractor.py

haystack/components/extractors/named_entity_extractor.py

e2e/pipelines/test_named_entity_extractor.py

julian-risch

One more question and then we're ready to merge.

julian-risch · 2024-12-19T17:43:25Z

test/components/generators/test_hugging_face_local_generator.py

-            "device": "cuda:0",
-            "token": "another-test-token",
-        }
+        huggingface_pipeline_kwargs = {"model": "gpt2", "device": "cuda:0", "token": "another-test-token"}


Could you please briefly explain this change? Before we explicitly set task to "text-generation" in pipeline_kwargs and initialize the component with task="text2text-generation" to then check if the pipeline_kwargs override the init parameter, right?

reverted as discussed

julian-risch

LGTM! 👍

mpangrazzi added 2 commits December 18, 2024 17:05

add 'token' support to NamedEntityExtractor to enable using private m…

26d6b9a

…odels on HF backend

fix existing error message format

6b043da

github-actions bot added topic:tests type:documentation Improvements on the docs labels Dec 18, 2024

mpangrazzi added 2 commits December 18, 2024 17:42

add release note

adb7627

add HF_API_TOKEN to e2e workflow

9ccdd0e

github-actions bot added the topic:CI label Dec 19, 2024

add informative comment

505b9d4

mpangrazzi marked this pull request as ready for review December 19, 2024 08:44

mpangrazzi requested review from a team as code owners December 19, 2024 08:44

mpangrazzi requested review from dfokina, davidsbatista and julian-risch and removed request for a team December 19, 2024 08:44

julian-risch removed the request for review from davidsbatista December 19, 2024 10:10

julian-risch requested changes Dec 19, 2024

View reviewed changes

test/components/extractors/test_named_entity_extractor.py Show resolved Hide resolved

haystack/components/extractors/named_entity_extractor.py Show resolved Hide resolved

davidsbatista reviewed Dec 19, 2024

View reviewed changes

e2e/pipelines/test_named_entity_extractor.py Show resolved Hide resolved

mpangrazzi added 2 commits December 19, 2024 16:47

Updated to_dict / from_dict to handle 'token' correctly ; Added tests

35edc33

Fix lint

806b56a

julian-risch requested changes Dec 20, 2024

View reviewed changes

Revert unwanted change

3d9ed8f

julian-risch self-requested a review December 20, 2024 10:12

julian-risch approved these changes Dec 20, 2024

View reviewed changes

mpangrazzi merged commit c192488 into main Dec 20, 2024
20 checks passed

mpangrazzi deleted the named-entity-extractor-private-models branch December 20, 2024 10:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Named entity extractor private models #8658

Named entity extractor private models #8658

mpangrazzi commented Dec 18, 2024 •

edited

Loading

coveralls commented Dec 18, 2024 •

edited

Loading

julian-risch left a comment •

edited

Loading

julian-risch left a comment

julian-risch Dec 19, 2024

mpangrazzi Dec 20, 2024

julian-risch left a comment

Named entity extractor private models #8658

Named entity extractor private models #8658

Conversation

mpangrazzi commented Dec 18, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Checklist

coveralls commented Dec 18, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12429166430

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

julian-risch left a comment • edited Loading

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Dec 19, 2024

Choose a reason for hiding this comment

mpangrazzi Dec 20, 2024

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

mpangrazzi commented Dec 18, 2024 •

edited

Loading

coveralls commented Dec 18, 2024 •

edited

Loading

julian-risch left a comment •

edited

Loading