Check document_store and embedding_model dimensions before calculating embeddings #5188

AlexGWOmron · 2023-06-22T01:24:53Z

Is your feature request related to a problem? Please describe.
When running document_store.update_embeddings(retriever=embedding_retriever), embeddings will first be calculated then saved to the doc store.

However, if the embedding dimensions differ, you get an error and you have now calculated the embeddings wastefully. E.g.

RuntimeError: Embedding dimensions of the model (1024) don't match the embedding dimensions of the document store (768). Initiate FAISSDocumentStore again with arg embedding_dim=1024.

This is more of a problem when paying (e.g. open_ai).

Describe the solution you'd like
A check against both document_store and embedding dimensions before running the embedding calculations.

julian-risch · 2023-07-05T08:11:01Z

@AlexGWOmron Thank you for this suggestion. I agree that it makes a lot of sense to check document_store and embedding dimensions before running the embedding calculations. Would you maybe like to contribute this feature and open a PR? We can give early feedback if you make it a draft PR. Guidelines are here 🙂

awinml · 2023-07-20T20:15:22Z

@julian-risch I would like to work on this issue. What would be the preferred implementation that I should follow?

julian-risch · 2023-07-20T20:23:17Z

@awinml That's great to hear! You can find our general contributor guidelines here: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md
For this particular issue, I'd say in the beginning of the update_embeddings method, their should be a check for whether the embedding dimensions set in the document store and the embedding dimensions of the model used by the retriever for the embeddings are the same. This check could be implemented in a separate function that just get's a document store and a retriever as input maybe? In case we can't infer the embedding dimensions from the model, the alternative would be to calculate the embeddings of a first document and then check whether it's the same as the setting of the document store. That's still better than first calculating the. embeddings for dozens of documents and then noticing that they can't be stored.
If you open a draft PR, we can give early feedback! No need to wait until you have a solution ready. Before the PR can be merged, we will need to have some unit tests but we can guide you there. Looking forward to your PR! 🙂

awinml · 2023-07-21T21:26:20Z

@julian-risch Thank you for the detailed explanation. I understand the overall implementation suggestion and will open a draft PR shortly. I'll make sure to follow the contributor guidelines and include unit tests as well.

AnushreeBannadabhavi · 2024-02-26T16:46:00Z

Hi @julian-risch! I'd like to work on this if it's still open.

awinml · 2024-02-26T17:37:41Z

Hi @AnushreeBannadabhavi, I am working on other issues at the moment, feel free to take it up.

anakin87 · 2024-03-14T07:32:00Z

Done in #7357 for FAISS.

Haystack 1.x is entering a Long Term Support phase: we will care to keep it working and solve bugs,
but the development effort will focus on the recently released 2.0.

Therefore, I would not change any other document repositories and close this issue.

julian-risch added Contributions wanted! Looking for external contributions topic:document_store labels Jul 5, 2023

julian-risch added topic:retriever P2 Medium priority, add to the next sprint if no P1 available labels Jul 5, 2023

masci removed the P2 Medium priority, add to the next sprint if no P1 available label Dec 19, 2023

anakin87 added this to Haystack - Contributions wanted Feb 10, 2024

anakin87 added the 1.x label Feb 16, 2024

This was referenced Mar 4, 2024

feat: check document store and retriever dimensions before calculating embeddings for all documents #7292

Closed

feat: check document store and retriever dimensions before calculating embeddings for all documents #7323

Closed

AnushreeBannadabhavi mentioned this issue Mar 14, 2024

feat: check document store and retriever dimensions before calculating embeddings for all documents #7357

Merged

anakin87 closed this as completed Mar 14, 2024

github-project-automation bot moved this to Done in Haystack - Contributions wanted Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check document_store and embedding_model dimensions before calculating embeddings #5188

Check document_store and embedding_model dimensions before calculating embeddings #5188

AlexGWOmron commented Jun 22, 2023 •

edited

Loading

julian-risch commented Jul 5, 2023 •

edited

Loading

awinml commented Jul 20, 2023

julian-risch commented Jul 20, 2023

awinml commented Jul 21, 2023

AnushreeBannadabhavi commented Feb 26, 2024

awinml commented Feb 26, 2024 •

edited

Loading

anakin87 commented Mar 14, 2024

Check document_store and embedding_model dimensions before calculating embeddings #5188

Check document_store and embedding_model dimensions before calculating embeddings #5188

Comments

AlexGWOmron commented Jun 22, 2023 • edited Loading

julian-risch commented Jul 5, 2023 • edited Loading

awinml commented Jul 20, 2023

julian-risch commented Jul 20, 2023

awinml commented Jul 21, 2023

AnushreeBannadabhavi commented Feb 26, 2024

awinml commented Feb 26, 2024 • edited Loading

anakin87 commented Mar 14, 2024

AlexGWOmron commented Jun 22, 2023 •

edited

Loading

julian-risch commented Jul 5, 2023 •

edited

Loading

awinml commented Feb 26, 2024 •

edited

Loading