-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Embedders
design
#5390
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main observation and needing to be confident what is the right path I'm divided between having two embedding classes for embedder per model type: i.e _Model_TextEmbedder and _Model_DocumentEmbedder. Or one Embedder that handles both str and Document list. Can we discuss pros/cons for this issue?
@vblagoje: me and @anakin87 had quite a few exchanges on this, you can find them back on the Notion page https://www.notion.so/deepsetai/Embedders-design-aae6bf8628ee4bf59a6779703ce3fb06 TL;DR: the two have quite a different interface, so mixing them would be awkward. To be specific: a On top of this, both components are going to be quite small, because they just use the Embedder class, which is not a component, and the Embedder class will take care of reusing the models. In this scenario, components are just "interfaces" between the Pipeline and the Embedder, so it makes sense that we make one for each scenario. I hope that explains our reasoning! Is there some pitfall we haven't considered in your opinion? The only one I imagined was class proliferation, but after seeing the list of expected components, I'm reassured we're not going to have an excessive number of them. That's subjective though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number 1: I love this proposal. Here are a few things that will make it a lot more intuitive to people, which are mainly an agreement to what you've already pointed out:
- No longer having to use the retriever component to embed documents to document stores
- Makes it easier to understand and see the relevance of the 'embedding' concept and the difference of it in the indexing vs querying pipelines.
But here's one comment to think about from my side:
I don't think the point of having an Embedder class vs an Embedder component that wraps that class comes across in this proposal. I already have this question: What's the purpose of this distinction. I'm sure you have one, but the 2 concepts are a bit too mixed together in the current proposal IMHO. I would predict that if we loosely refer to both as embedder as in this proposal we will have a hard time explaining to people what is what and how to create your own embedded class vs component.
Hey @TuanaCelik... Thanks for your comments!
I will update the proposal to better clarify this point... |
I do partly agree with Tuana, it might be confusing having a class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the need of a base class that's not a component, so apologies if my comments were already discussed. I find the concept of a basic, non-component version of the embedder hard to grasp, I hope we can find another way.
After a discussion with @ZanSara, @masci, and @julian-risch, I made revisions to the proposal to improve it. |
@anakin87 I really prefer the distinction of the EmbedderService vs the Embedder in the updated proposal. |
This proposal is very clear and seems to take the best path forward; great work all! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal looks very good to me already! 👍 Going forward and for further clarification, it would be great if you address two small things: What if a user has indexed documents in the store and now want's to update the embeddings by calculating them with a different model instead. Ideally we don't need to run the full indexing pipeline with the preprocessing again.
For example, with the current Haystack implementation the user would run document_store.update_embeddings
. What's your plan for v2?
For the EmbeddingService singleton, I'd suggest to get feedback from the platform team too to better understand how reusing the service across different pipelines affects them and how much flexibility they need. For speeding up indexing they sometimes might want to run the same model multiple times in parallel for just one pipeline.
Thank you for the detailed proposal, @anakin87! After reading the proposal and the conversation above, it makes sense to have |
Based on the current # get all the documents
docs = memory_document_store.filter_documents()
# compute the embedding with the new model
new_embedder = HFDocumentEmbedder(model_name="new-model")
docs_with_embeddings = new_embedder.run(documents=docs)
# overwrite the documents
memory_document_store.write_documents(documents=docs_with_embeddings, policy=DuplicatePolicy.OVERWRITE) @julian-risch does this approach seem reasonable or do you have any suggestions for improving it? Should I add this point to the proposal?
I asked the platform team for feedback! |
@anakin87 let's add the migration snippet from your previous comment to the proposal. You already mention |
@bilgeyucel thanks for sharing your observation. I understand that the naming in the proposal may not be straightforward for users. We considered the current naming with the understanding that in multimodal retrieval, the query may not be textual. It may make sense to have several (query) embedders: _model_TextEmbedder, _model_TableEmbedder, _model_ImageEmbedder... If you have any better ideas for the naming, taking all factors into consideration, please suggest them! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 👍 Thanks for including the update_embeddings
functionality in the proposal. Depending on how frequently it is used, we might think about a simpler way again later. For example, in the old implementation we didn't need to have the embeddings of all the documents in memory.
Thank you also for reaching out to platform for feedback on the singleton. From my side, nothing is blocking the proposal and I hope some work can be scheduled for the next sprint already! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great team work, awesome lead @anakin87
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Looks good to me :)
I have a few more things to add. I'm not sure if some of these should end up in other components but I want to leave them here in case they are relevant.
|
Hey @mathislucka, thank you for your comments! I'll let @anakin87 reply to you in details, while I am especially curious about point 2. It is currently not doable with Haystack v1, right? It would be amazing to have multi-embedding support in v2, but I wonder how many document store engines will actually support that, and how. Do you know of some that surely do? Some specific use cases we can use to understand the requirements better? I am going to open a Notion page to discuss this feature separately, as I believe it may be worth its own small proposal. https://www.notion.so/deepsetai/Multi-Embedding-Support-1e8b3ed738744218bdb2b2c53edc43a3 |
Hey @mathislucka... Thanks for your observations. About point 2 (multiple embeddings per Document): I am not under the impression that this feature is becoming increasingly popular or that many vector DBs support it, but let's discuss the potential use cases on the Notion page.
|
The idea is also to provide "high level" objects in Haystack that wrap these "low level" components with guarded defaults (this is some sort of design strategy we want to apply to other abstractions too, like pipelines). So imagine a |
I understand the idea and I think it's a good direction but I'm not sure how it would help that specific validation issue? If document and query embedder are split up and users can choose a model for each one of them, then I don't really understand how to provide a default component that would avoid the trap of accidentally selecting 2 different models. Unless they can't select any model at all ( |
While I understand there is a shift of paradigm and we're asking a bit more from the users, on the other end I have the impression this is not an issue that good documentation and meaningful error messages can't fix. After all, users can still do this mistake in v1 by simply not using the retriever in indexing: often we have to remind people to add it, because it's so unintuitive it looks like a mistake in the docs. This new design asks the users to know they need the same model, that's true, but at least it's going to be easier for them to understand how to use the components. There are ways to fix this specific problem though: they just don't concern the Embedders. We plan to overhaul our design of the Document classes due to other limitations, so let's take that opportunity to remove the problem entirely: we should make the Documents remember what they were embedded by. This will then allow Retriever to check whether query and documents were embedded using the same model. Does that sound viable? |
@ZanSara yes that sounds very good. And I don't have anything against putting the burden on the user to select 2 models that match, I just think some kind of validation/warning would make it easier for them to debug a pipeline. Embeddings that have matching dimensions but come from different models will absolutely "work" but the results are going to be really bad and then the user might not figure out why. Your approach sounds very good though and it will enable this type of warning/validation 👍 |
* first draft * rename * refinements * added clarifications * improvements * improvements * improvements * further improvements * fix typo * Apply suggestions from code review Co-authored-by: Massimiliano Pippi <[email protected]> * adapt to new Canals I/O * fix links to previous proposals * fix * add migration example: update_embeddings * rename EmbeddingService to EmbeddingBackend --------- Co-authored-by: Massimiliano Pippi <[email protected]>
Related Issues
This proposal aims to define the Embedder design in Haystack v2.
Let's discuss...