Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check document_store and embedding_model dimensions before calculating embeddings #5188

Closed
AlexGWOmron opened this issue Jun 22, 2023 · 7 comments

Comments

@AlexGWOmron
Copy link

AlexGWOmron commented Jun 22, 2023

Is your feature request related to a problem? Please describe.
When running document_store.update_embeddings(retriever=embedding_retriever), embeddings will first be calculated then saved to the doc store.

However, if the embedding dimensions differ, you get an error and you have now calculated the embeddings wastefully. E.g.

RuntimeError: Embedding dimensions of the model (1024) don't match the embedding dimensions of the document store (768). Initiate FAISSDocumentStore again with arg embedding_dim=1024.

This is more of a problem when paying (e.g. open_ai).

Describe the solution you'd like
A check against both document_store and embedding dimensions before running the embedding calculations.

@julian-risch julian-risch added Contributions wanted! Looking for external contributions topic:document_store labels Jul 5, 2023
@julian-risch
Copy link
Member

julian-risch commented Jul 5, 2023

@AlexGWOmron Thank you for this suggestion. I agree that it makes a lot of sense to check document_store and embedding dimensions before running the embedding calculations. Would you maybe like to contribute this feature and open a PR? We can give early feedback if you make it a draft PR. Guidelines are here 🙂

@julian-risch julian-risch added topic:retriever P2 Medium priority, add to the next sprint if no P1 available labels Jul 5, 2023
@awinml
Copy link
Contributor

awinml commented Jul 20, 2023

@julian-risch I would like to work on this issue. What would be the preferred implementation that I should follow?

@julian-risch
Copy link
Member

@awinml That's great to hear! You can find our general contributor guidelines here: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md
For this particular issue, I'd say in the beginning of the update_embeddings method, their should be a check for whether the embedding dimensions set in the document store and the embedding dimensions of the model used by the retriever for the embeddings are the same. This check could be implemented in a separate function that just get's a document store and a retriever as input maybe? In case we can't infer the embedding dimensions from the model, the alternative would be to calculate the embeddings of a first document and then check whether it's the same as the setting of the document store. That's still better than first calculating the. embeddings for dozens of documents and then noticing that they can't be stored.
If you open a draft PR, we can give early feedback! No need to wait until you have a solution ready. Before the PR can be merged, we will need to have some unit tests but we can guide you there. Looking forward to your PR! 🙂

@awinml
Copy link
Contributor

awinml commented Jul 21, 2023

@julian-risch Thank you for the detailed explanation. I understand the overall implementation suggestion and will open a draft PR shortly. I'll make sure to follow the contributor guidelines and include unit tests as well.

@masci masci removed the P2 Medium priority, add to the next sprint if no P1 available label Dec 19, 2023
@anakin87 anakin87 added the 1.x label Feb 16, 2024
@AnushreeBannadabhavi
Copy link
Contributor

Hi @julian-risch! I'd like to work on this if it's still open.

@awinml
Copy link
Contributor

awinml commented Feb 26, 2024

Hi @AnushreeBannadabhavi, I am working on other issues at the moment, feel free to take it up.

@anakin87
Copy link
Member

Done in #7357 for FAISS.

Haystack 1.x is entering a Long Term Support phase: we will care to keep it working and solve bugs,
but the development effort will focus on the recently released 2.0.

Therefore, I would not change any other document repositories and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants