Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

Closed
wants to merge 397 commits into from
Closed

feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

wants to merge 397 commits into from

Conversation

AnushreeBannadabhavi
Copy link
Contributor

@AnushreeBannadabhavi AnushreeBannadabhavi commented Mar 7, 2024

Related Issues

Proposed Changes:

  • Currently, in haystack v1.x, when document_store.update_embeddings() gets called, the embeddings are calculated for all the documents and then saved to the document store. If the embedding dimension set in the document store differs from that of the retriever model, a runtime error is raised.
  • However, we don't need to calculate embeddings for all the documents in the document store to raise this error. This is especially an issue when using paid APIs (openai) or when the embedding calculation is time consuming.
  • A more efficient alternative is to calculate the embedding of a single document and verify that the embedding dimensions match between the document store and the retriever.

How did you test it?

  • Added two unit tests in test_faiss.py
  • Modified the basic_faq_pipeline.py to test the change.
    - Used FAISSDocumentStore instead of ElasticsearchDocumentStore. I instantiated the docustore without the embedding_dim parameter. Therefore the default value 768 is set in the docustore making the embedding dimension of the retriever and docustore different.

Notes for the reviewer

The same feature can be added in other document stores: memory.py, mongodb_atlas.py, pinecone.py, search_engine.py and weaviate.py. Once this change is reviewed and looks fine, I can add it in the other document stores too.

Checklist

anakin87 and others added 30 commits January 12, 2024 12:22
…#6713)

* first draft for ranker

* same for the reader

* consider also bnb_4bit_compute_dtype

* dtype serialization in hugging_face_local_generator

* add release note

* address dtype defined in huggingface_pipeline_kwargs

* test quantization options in reader

* fix

* serialize quantization_config

* test quantization_config serialization

* address feedback

* fix typo
…ators (#6715)

* renamed model_name or model_name_or_path to model

* added release notes

* Update releasenotes/notes/renamed-model_name-or-model_name_or_path-to-model-184490cbb66c4d7c.yaml

---------

Co-authored-by: ZanSara <[email protected]>
…Transcriber classes (#6731)

* rename model parameter in local transcriber

* fix tests for local transcriber

* rename model parameter in remote transcriber

* fix tests for remote transcriber

* reno

---------

Co-authored-by: Stefano Fiorucci <[email protected]>
…Embedder classes (#6733)

* rename model parameter in the openai doc embedder

* fix tests for openai doc embedder

* rename model parameter in the openai text embedder

* fix tests for openai text embedder

* rename model parameter in the st doc embedder

* fix tests for st doc embedder

* rename model parameter in the st backend

* fix tests for st backend

* rename model parameter in the st text embedder

* fix tests for st text embedder

* fix docstring

* fix pipeline utils

* fix e2e

* reno

* fix the indexing pipeline _create_embedder function

* fix e2e eval rag pipeline

* pytest
)

* Add method to set a Component input type with default value

* Add release notes

* Fix linting

* Stick to old set_input_types for now
…ityRanker` (#6734)

* rename model parameter in transformers ranker

* fix tests for transformers ranker

* reno

* reno

* typo
…rameters correctly (#6701)

* fix: `ComponentMeta.__call__` handles keyword- and positional-only parameters correctly

* Update release note
… score of 0.0 (#6717)

* fix!: `InMemoryBM25Retriever` no longer returns documents that have a score of 0.0

Also update tests to accommodate the new behavior.

* Remove superfluous code
* highlight optional connections in Pipeline.draw()

* reno
* fix hybrid pipeline e2e test

* warmup

* write to the right docstore
…6736)

* rename model parameter and internam model attribute in ExtractiveReader

* fix tests for ExtractiveReader

* fix e2e

* reno

* another fix

* review feedback

* Update releasenotes/notes/rename-model-param-reader-b8cbb0d638e3b8c2.yaml
* Move StopWordsCriteria to hf_utils.py

* Raise ValueError for invalid StopWordsCriteria tokenizer

* StopWordsCriteria, make sure padding token exists

* Use proper torch types

* Update unit tests
* Add weight and ranking_mode as params to run for easier experimentation

* renaming of metadata to meta

* User logger.warning instead of warnings

* Add another unit test

* Add support for sort_order and fix formatting of error messages

* Make MetaFieldRanker more robust. Doesn't crash pipeline if some Documents are missing keys.

* Don't print same warning message twice

* Add another test

* Making MetaFieldRanker more robust

* Move up if return statement to earlier in the function

* Setting up infer_type

* Remove infer_type for now

* Release notes

* Add init file

* Update releasenotes/notes/metafieldranker_sort-order_refactor-2000d89dc40dc15a.yaml

Co-authored-by: Stefano Fiorucci <[email protected]>

---------

Co-authored-by: Stefano Fiorucci <[email protected]>
Prior to this change, this broke `pytest` workflows in VSCode due to identical test names in this file and the integration/unit test file.
…r` (#6744)

* rename model_name_or_path to simply model

* fix tests

* reno
* feat: Framework-agnostic device management

* Add release note

* Linting

* Fix test

* Add `first_device` property, expand release notes, validate `ComponentDevice` state
* Fix lazy import in HuggingFaceLocalGenerator

* Fix pylint

* Import fix after merge
* feat-added-split-by-page-to-DocumentSplitter

* added test case and the suggested changes

* Update document_splitter.py

* Update haystack/components/preprocessors/document_splitter.py

* Update test_document_splitter.py

---------

Co-authored-by: Sebastian Husch Lee <[email protected]>
* refactor: Move HF-specific model serde code to a new submodule.

* Remove unused import
* Speedup tests for PyPDFToDocument

* Added unit test and removed skipping of empty pages

* add release note

* Add back some integration marks
Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](actions/cache@v3...v4)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@CLAassistant
Copy link

CLAassistant commented Mar 14, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
7 out of 9 committers have signed the CLA.

✅ silvanocerza
✅ dfokina
✅ shadeMe
✅ davidsbatista
✅ dcbark01
✅ Lord-Haji
✅ awinml
❌ bilgeyucel
❌ wochinge
You have signed the CLA already but the status is still pending? Let us recheck it.

@AnushreeBannadabhavi
Copy link
Contributor Author

I accidentally did a force push. I'll just close this PR and open a new one instead of reverting the push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment