Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: check document store and retriever dimensions before calculating embeddings for all documents #7357

Merged

Conversation

AnushreeBannadabhavi
Copy link
Contributor

Related Issues

Proposed Changes:

  • Currently, in haystack v1.x, when document_store.update_embeddings() gets called, the embeddings are calculated for all the documents and then saved to the document store. If the embedding dimension set in the document store differs from that of the retriever model, a runtime error is raised.
  • However, we don't need to calculate embeddings for all the documents in the document store to raise this error. This is especially an issue when using paid APIs (openai) or when the embedding calculation is time consuming.
  • A more efficient alternative is to calculate the embedding of a single document and verify that the embedding dimensions match between the document store and the retriever.

How did you test it?

  • Added two unit tests in test_faiss.py
  • Modified the basic_faq_pipeline.py to test the change.
    - Used FAISSDocumentStore instead of ElasticsearchDocumentStore. I instantiated the docustore without the embedding_dim parameter. Therefore the default value 768 is set in the docustore making the embedding dimension of the retriever and docustore different.

Checklist

@AnushreeBannadabhavi AnushreeBannadabhavi requested review from a team as code owners March 14, 2024 05:47
@AnushreeBannadabhavi AnushreeBannadabhavi requested review from dfokina and davidsbatista and removed request for a team March 14, 2024 05:47
@AnushreeBannadabhavi
Copy link
Contributor Author

@anakin87 Thank you for reviewing my PR 7323! I had to close it because I accidentally changed the upstream branch from v1.x to main. This PR has the same changes as the previous one.

@anakin87 anakin87 requested review from anakin87 and removed request for davidsbatista March 14, 2024 07:26
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thank you for this contribution!

@anakin87 anakin87 merged commit 553badc into deepset-ai:v1.x Mar 14, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants