feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

AnushreeBannadabhavi · 2024-03-07T04:11:11Z

Related Issues

fixes Check document_store and embedding_model dimensions before calculating embeddings #5188

Proposed Changes:

Currently, in haystack v1.x, when document_store.update_embeddings() gets called, the embeddings are calculated for all the documents and then saved to the document store. If the embedding dimension set in the document store differs from that of the retriever model, a runtime error is raised.
However, we don't need to calculate embeddings for all the documents in the document store to raise this error. This is especially an issue when using paid APIs (openai) or when the embedding calculation is time consuming.
A more efficient alternative is to calculate the embedding of a single document and verify that the embedding dimensions match between the document store and the retriever.

How did you test it?

Added two unit tests in test_faiss.py
Modified the basic_faq_pipeline.py to test the change.
- Used FAISSDocumentStore instead of ElasticsearchDocumentStore. I instantiated the docustore without the embedding_dim parameter. Therefore the default value 768 is set in the docustore making the embedding dimension of the retriever and docustore different.

Notes for the reviewer

The same feature can be added in other document stores: memory.py, mongodb_atlas.py, pinecone.py, search_engine.py and weaviate.py. Once this change is reviewed and looks fine, I can add it in the other document stores too.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

…#6713) * first draft for ranker * same for the reader * consider also bnb_4bit_compute_dtype * dtype serialization in hugging_face_local_generator * add release note * address dtype defined in huggingface_pipeline_kwargs * test quantization options in reader * fix * serialize quantization_config * test quantization_config serialization * address feedback * fix typo

…ators (#6715) * renamed model_name or model_name_or_path to model * added release notes * Update releasenotes/notes/renamed-model_name-or-model_name_or_path-to-model-184490cbb66c4d7c.yaml --------- Co-authored-by: ZanSara <[email protected]>

…Transcriber classes (#6731) * rename model parameter in local transcriber * fix tests for local transcriber * rename model parameter in remote transcriber * fix tests for remote transcriber * reno --------- Co-authored-by: Stefano Fiorucci <[email protected]>

…Embedder classes (#6733) * rename model parameter in the openai doc embedder * fix tests for openai doc embedder * rename model parameter in the openai text embedder * fix tests for openai text embedder * rename model parameter in the st doc embedder * fix tests for st doc embedder * rename model parameter in the st backend * fix tests for st backend * rename model parameter in the st text embedder * fix tests for st text embedder * fix docstring * fix pipeline utils * fix e2e * reno * fix the indexing pipeline _create_embedder function * fix e2e eval rag pipeline * pytest

) * Add method to set a Component input type with default value * Add release notes * Fix linting * Stick to old set_input_types for now

…ityRanker` (#6734) * rename model parameter in transformers ranker * fix tests for transformers ranker * reno * reno * typo

…rameters correctly (#6701) * fix: `ComponentMeta.__call__` handles keyword- and positional-only parameters correctly * Update release note

… score of 0.0 (#6717) * fix!: `InMemoryBM25Retriever` no longer returns documents that have a score of 0.0 Also update tests to accommodate the new behavior. * Remove superfluous code

* highlight optional connections in Pipeline.draw() * reno

* fix hybrid pipeline e2e test * warmup * write to the right docstore

…6736) * rename model parameter and internam model attribute in ExtractiveReader * fix tests for ExtractiveReader * fix e2e * reno * another fix * review feedback * Update releasenotes/notes/rename-model-param-reader-b8cbb0d638e3b8c2.yaml

* Move StopWordsCriteria to hf_utils.py * Raise ValueError for invalid StopWordsCriteria tokenizer * StopWordsCriteria, make sure padding token exists * Use proper torch types * Update unit tests

* Add weight and ranking_mode as params to run for easier experimentation * renaming of metadata to meta * User logger.warning instead of warnings * Add another unit test * Add support for sort_order and fix formatting of error messages * Make MetaFieldRanker more robust. Doesn't crash pipeline if some Documents are missing keys. * Don't print same warning message twice * Add another test * Making MetaFieldRanker more robust * Move up if return statement to earlier in the function * Setting up infer_type * Remove infer_type for now * Release notes * Add init file * Update releasenotes/notes/metafieldranker_sort-order_refactor-2000d89dc40dc15a.yaml Co-authored-by: Stefano Fiorucci <[email protected]> --------- Co-authored-by: Stefano Fiorucci <[email protected]>

Prior to this change, this broke `pytest` workflows in VSCode due to identical test names in this file and the integration/unit test file.

…r` (#6744) * rename model_name_or_path to simply model * fix tests * reno

* feat: Framework-agnostic device management * Add release note * Linting * Fix test * Add `first_device` property, expand release notes, validate `ComponentDevice` state

* Fix lazy import in HuggingFaceLocalGenerator * Fix pylint * Import fix after merge

* feat-added-split-by-page-to-DocumentSplitter * added test case and the suggested changes * Update document_splitter.py * Update haystack/components/preprocessors/document_splitter.py * Update test_document_splitter.py --------- Co-authored-by: Sebastian Husch Lee <[email protected]>

* refactor: Move HF-specific model serde code to a new submodule. * Remove unused import

* Speedup tests for PyPDFToDocument * Added unit test and removed skipping of empty pages * add release note * Add back some integration marks

Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](actions/cache@v3...v4) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

CLAassistant · 2024-03-14T03:26:46Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
7 out of 9 committers have signed the CLA.

✅ silvanocerza
✅ dfokina
✅ shadeMe
✅ davidsbatista
✅ dcbark01
✅ Lord-Haji
✅ awinml
❌ bilgeyucel
❌ wochinge
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

AnushreeBannadabhavi · 2024-03-14T04:05:31Z

I accidentally did a force push. I'll just close this PR and open a new one instead of reverting the push

anakin87 and others added 30 commits January 12, 2024 12:22

rename model_name_or_path to model in test (#6732)

24c71bd

fix leftover model_name_or_path param (#6737)

3156343

feat: Add method to set a Component input type with default value (#6728

e0afe27

) * Add method to set a Component input type with default value * Add release notes * Fix linting * Stick to old set_input_types for now

feat!: Rename model_name_or_path to model in `TransformersSimilar…

0616197

…ityRanker` (#6734) * rename model parameter in transformers ranker * fix tests for transformers ranker * reno * reno * typo

fix: ComponentMeta.__call__ handles keyword- and positional-only pa…

4647f2a

…rameters correctly (#6701) * fix: `ComponentMeta.__call__` handles keyword- and positional-only parameters correctly * Update release note

fix!: InMemoryBM25Retriever no longer returns documents that have a…

a5189dd

… score of 0.0 (#6717) * fix!: `InMemoryBM25Retriever` no longer returns documents that have a score of 0.0 Also update tests to accommodate the new behavior. * Remove superfluous code

feat: Highlight optional connections in Pipeline.draw() (#6724)

24afc2a

* highlight optional connections in Pipeline.draw() * reno

fix pipeline test (#6741)

8eba053

fix: hybrid pipeline e2e test (#6740)

b236ea4

* fix hybrid pipeline e2e test * warmup * write to the right docstore

refactor: Extract HF stop words handling in hf_utils.py (#6745)

8cafff0

* Move StopWordsCriteria to hf_utils.py * Raise ValueError for invalid StopWordsCriteria tokenizer * StopWordsCriteria, make sure padding token exists * Use proper torch types * Update unit tests

test: Rename RAG E2E test file (#6750)

d6cafea

Prior to this change, this broke `pytest` workflows in VSCode due to identical test names in this file and the integration/unit test file.

fix a few docstrings (#6743)

909c1eb

feat!: rename model_name_or_path to model in `NamedEntityExtracto…

b8b8b5d

…r` (#6744) * rename model_name_or_path to simply model * fix tests * reno

feat!: Framework-agnostic device management (#6748)

7376838

* feat: Framework-agnostic device management * Add release note * Linting * Fix test * Add `first_device` property, expand release notes, validate `ComponentDevice` state

chore: Fix lazy import in HuggingFaceLocalGenerator (#6752)

88191e7

* Fix lazy import in HuggingFaceLocalGenerator * Fix pylint * Import fix after merge

test: Update E2E tests to use Pipeline.dump/load (#6756)

6a15145

chore: bump Haystack version to beta5 (#6757)

d1bdb8c

refactor: Move HF-specific model serde code to a new submodule. (#6754)

eaec5bf

* refactor: Move HF-specific model serde code to a new submodule. * Remove unused import

feat: Add page breaks to default PDF to Document converter (#6755)

c0b6743

* Speedup tests for PyPDFToDocument * Added unit test and removed skipping of empty pages * add release note * Add back some integration marks

feat: Add serde methods to HTMLToDocument (#6758)

5d66d04

Fix __version__ handling (#6765)

1c76aa0

Speed up Document dataclass import (#6767)

8079501

AnushreeBannadabhavi closed this Mar 14, 2024

AnushreeBannadabhavi mentioned this pull request Mar 14, 2024

feat: check document store and retriever dimensions before calculating embeddings for all documents #7357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

AnushreeBannadabhavi commented Mar 7, 2024 •

edited

Loading

CLAassistant commented Mar 14, 2024 •

edited

Loading

AnushreeBannadabhavi commented Mar 14, 2024

feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

Conversation

AnushreeBannadabhavi commented Mar 7, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

CLAassistant commented Mar 14, 2024 • edited Loading

AnushreeBannadabhavi commented Mar 14, 2024

AnushreeBannadabhavi commented Mar 7, 2024 •

edited

Loading

CLAassistant commented Mar 14, 2024 •

edited

Loading