proposal: `Embedders` design #5390

anakin87 · 2023-07-19T16:31:17Z

Related Issues

related: Implement Embedders components (2.0) #5312

This proposal aims to define the Embedder design in Haystack v2.
Let's discuss...

vblagoje

The main observation and needing to be confident what is the right path I'm divided between having two embedding classes for embedder per model type: i.e _Model_TextEmbedder and _Model_DocumentEmbedder. Or one Embedder that handles both str and Document list. Can we discuss pros/cons for this issue?

ZanSara · 2023-07-21T14:14:01Z

@vblagoje: me and @anakin87 had quite a few exchanges on this, you can find them back on the Notion page https://www.notion.so/deepsetai/Embedders-design-aae6bf8628ee4bf59a6779703ce3fb06

TL;DR: the two have quite a different interface, so mixing them would be awkward.

To be specific: a DocumentEmbedder has a lot more parameters about what exactly to embed of the incoming document, like metadata, how much of the metadata, how to embed meta and content together (prefixing the meta? Or suffixing it? How to serialize the meta before attaching it to the content? etc etc). While the basic Embedder that accepts only text needs none of those parameters. So merging the two interfaces is convenient only in indexing, while it's very noisy for every other usecase that requires only a simple embedder of strings, like querying.

On top of this, both components are going to be quite small, because they just use the Embedder class, which is not a component, and the Embedder class will take care of reusing the models. In this scenario, components are just "interfaces" between the Pipeline and the Embedder, so it makes sense that we make one for each scenario.

I hope that explains our reasoning! Is there some pitfall we haven't considered in your opinion? The only one I imagined was class proliferation, but after seeing the list of expected components, I'm reassured we're not going to have an excessive number of them. That's subjective though.

TuanaCelik

Number 1: I love this proposal. Here are a few things that will make it a lot more intuitive to people, which are mainly an agreement to what you've already pointed out:

No longer having to use the retriever component to embed documents to document stores
Makes it easier to understand and see the relevance of the 'embedding' concept and the difference of it in the indexing vs querying pipelines.

But here's one comment to think about from my side:
I don't think the point of having an Embedder class vs an Embedder component that wraps that class comes across in this proposal. I already have this question: What's the purpose of this distinction. I'm sure you have one, but the 2 concepts are a bit too mixed together in the current proposal IMHO. I would predict that if we loosely refer to both as embedder as in this proposal we will have a hard time explaining to people what is what and how to create your own embedded class vs component.

anakin87 · 2023-07-24T13:19:08Z

Hey @TuanaCelik... Thanks for your comments!

the HFBasicEmbedder class is for internal use only. (We could also think of naming it _HFBasicEmbedder). It takes care of creating and reusing instances to not waste memory.
the user should only use HFTextEmbedder for queries and HFDocumentEmbedder for Documents.

I will update the proposal to better clarify this point...

silvanocerza · 2023-07-24T13:47:28Z

I do partly agree with Tuana, it might be confusing having a class HFBasicEmbedder that is not a component.
Also we must make it clear that is not part of the public API and only an implementation detail.

masci

I don't fully understand the need of a base class that's not a component, so apologies if my comments were already discussed. I find the concept of a basic, non-component version of the embedder hard to grasp, I hope we can find another way.

proposals/text/5390-embedders.md

anakin87 · 2023-07-27T08:48:28Z

After a discussion with @ZanSara, @masci, and @julian-risch, I made revisions to the proposal to improve it.
Feel free to review it...

TuanaCelik · 2023-07-27T10:28:26Z

@anakin87 I really prefer the distinction of the EmbedderService vs the Embedder in the updated proposal.
Just one heads up. The Output in each component is missing : :)

proposals/text/5390-embedders.md

vblagoje · 2023-07-30T19:24:55Z

This proposal is very clear and seems to take the best path forward; great work all!

julian-risch

The proposal looks very good to me already! 👍 Going forward and for further clarification, it would be great if you address two small things: What if a user has indexed documents in the store and now want's to update the embeddings by calculating them with a different model instead. Ideally we don't need to run the full indexing pipeline with the preprocessing again.
For example, with the current Haystack implementation the user would run document_store.update_embeddings. What's your plan for v2?
For the EmbeddingService singleton, I'd suggest to get feedback from the platform team too to better understand how reusing the service across different pipelines affects them and how much flexibility they need. For speeding up indexing they sometimes might want to run the same model multiple times in parallel for just one pipeline.

bilgeyucel · 2023-07-31T09:47:28Z

Thank you for the detailed proposal, @anakin87! After reading the proposal and the conversation above, it makes sense to have TextEmbedder and DocumentEmbedder separately. I have one question here, not a big one, though. Do you foresee a different use case for _model_TextEmbedder apart from a query pipeline? If not, would it make sense to rename it as _model_QueryEmbedder?
My initial reason for this is that DocumentEmbedder name makes sense as I use it for Documents but TextEmbedder sounds like it can be used for embedding anything.

anakin87 · 2023-07-31T10:26:54Z

What if a user has indexed documents in the store and now want's to update the embeddings by calculating them with a different model instead. Ideally we don't need to run the full indexing pipeline with the preprocessing again. For example, with the current Haystack implementation the user would run document_store.update_embeddings. What's your plan for v2?

Based on the current MemoryDocumentStore implementation, I can imagine something like this:

# get all the documents
docs = memory_document_store.filter_documents()

# compute the embedding with the new model
new_embedder = HFDocumentEmbedder(model_name="new-model")
docs_with_embeddings = new_embedder.run(documents=docs)

# overwrite the documents
memory_document_store.write_documents(documents=docs_with_embeddings, policy=DuplicatePolicy.OVERWRITE)

@julian-risch does this approach seem reasonable or do you have any suggestions for improving it? Should I add this point to the proposal?

For the EmbeddingService singleton, I'd suggest to get feedback from the platform team too to better understand how reusing the service across different pipelines affects them and how much flexibility they need. For speeding up indexing they sometimes might want to run the same model multiple times in parallel for just one pipeline.

I asked the platform team for feedback!

masci · 2023-07-31T10:48:45Z

@anakin87 let's add the migration snippet from your previous comment to the proposal. You already mention migration in the "Drawbacks" section, that's a good place to add it.

anakin87 · 2023-07-31T10:53:13Z

Do you foresee a different use case for _model_TextEmbedder apart from a query pipeline? If not, would it make sense to rename it as _model_QueryEmbedder?
My initial reason for this is that DocumentEmbedder name makes sense as I use it for Documents but TextEmbedder sounds like it can be used for embedding anything.

@bilgeyucel thanks for sharing your observation. I understand that the naming in the proposal may not be straightforward for users.

We considered the current naming with the understanding that in multimodal retrieval, the query may not be textual. It may make sense to have several (query) embedders: _model_TextEmbedder, _model_TableEmbedder, _model_ImageEmbedder...

If you have any better ideas for the naming, taking all factors into consideration, please suggest them!

julian-risch

Looks good to me! 👍 Thanks for including the update_embeddings functionality in the proposal. Depending on how frequently it is used, we might think about a simpler way again later. For example, in the old implementation we didn't need to have the embeddings of all the documents in memory.
Thank you also for reaching out to platform for feedback on the singleton. From my side, nothing is blocking the proposal and I hope some work can be scheduled for the next sprint already! 🚀

vblagoje

LGTM, great team work, awesome lead @anakin87

masci

💯

ArzelaAscoIi

🚀 Looks good to me :)

mathislucka · 2023-08-02T11:13:12Z

I have a few more things to add. I'm not sure if some of these should end up in other components but I want to leave them here in case they are relevant.

How do we ensure that query and document embedding match?
In most cases, the query and the document should be embedded with the same model. This setup might make it easier to accidentally embed the query with a model that is different from the model that is used for the document. It would be great if this could be validated somehow. It's a subtle mistake to make but I could see Haystack users struggling to find the cause of bad pipeline performance, until they realize, that they changed the embedding model on the query pipeline but not in indexing or vice versa.
What about storing multiple embeddings for a document?
Multi-embedding retrieval is gaining more traction and there are many cases where it might make sense to store multiple embeddings per document (multimodal, embed a summary and the full document, embed other metadata). If the embedder can only store the embedding in a documents embedding field, multiple embeddings won't be possible.
Templates for embedders
Sometimes, it is important to embed metadata with the content of a document or to pre- or post-fix the document or query with a string. As an example, https://huggingface.co/intfloat/e5-large-v2 requires to pre-pend question/passage before the query and document: question: {{query}} and passage: {{document.content}}. In other cases, we might want to embed meta data in a meaningful way. For example, This article talks about {{doc.meta["company"]}}, it was published on {{doc.meta["publication_date"]}}. Here is the article's content: {{doc.content}}. It would be great if this kind of flexibility could be incorporated into the proposal.

ZanSara · 2023-08-04T07:51:24Z

Hey @mathislucka, thank you for your comments! I'll let @anakin87 reply to you in details, while I am especially curious about point 2. It is currently not doable with Haystack v1, right?

It would be amazing to have multi-embedding support in v2, but I wonder how many document store engines will actually support that, and how. Do you know of some that surely do? Some specific use cases we can use to understand the requirements better? I am going to open a Notion page to discuss this feature separately, as I believe it may be worth its own small proposal.

https://www.notion.so/deepsetai/Multi-Embedding-Support-1e8b3ed738744218bdb2b2c53edc43a3

anakin87 · 2023-08-04T09:05:01Z

Hey @mathislucka... Thanks for your observations.

About point 2 (multiple embeddings per Document): I am not under the impression that this feature is becoming increasingly popular or that many vector DBs support it, but let's discuss the potential use cases on the Notion page.

How do we ensure that query and document embedding match?
I understand your point.
The idea behind the proposal (and behind some aspects of V2) is to make things less magical and more explicit. So, we hope to help the user to become more aware of what he or she is doing.
In this sense, I wouldn't validate this aspect...
Templates for embedders
All of your examples make sense and are useful to guide development.
Most of them can be incorporated in the preparation work done by the DocumentEmbedder to get a customized string to embed, starting from the Document.
I don't know the implementation details yet, but we will work in that direction...

masci · 2023-08-04T09:17:32Z

The idea behind the proposal (and behind some aspects of V2) is to make things less magical and more explicit. So, we hope to help the user to become more aware of what he or she is doing.

The idea is also to provide "high level" objects in Haystack that wrap these "low level" components with guarded defaults (this is some sort of design strategy we want to apply to other abstractions too, like pipelines). So imagine a DPRDefaultEmbedder component available in the Haystack API that hides all of the embedder details, included the model (a bit extreme but say we pick a generic one for our users). Users will be exposed to the lower level details only when they need to tweak or customize this default component, but at that point we can assume a certain degree of awareness and being explicit in our API will be an advantage.

mathislucka · 2023-08-08T10:33:44Z

The idea is also to provide "high level" objects in Haystack that wrap these "low level" components with guarded defaults (this is some sort of design strategy we want to apply to other abstractions too, like pipelines). So imagine a DPRDefaultEmbedder component available in the Haystack API that hides all of the embedder details, included the model (a bit extreme but say we pick a generic one for our users). Users will be exposed to the lower level details only when they need to tweak or customize this default component, but at that point we can assume a certain degree of awareness and being explicit in our API will be an advantage.

I understand the idea and I think it's a good direction but I'm not sure how it would help that specific validation issue?

If document and query embedder are split up and users can choose a model for each one of them, then I don't really understand how to provide a default component that would avoid the trap of accidentally selecting 2 different models. Unless they can't select any model at all (DefaultDocumentEmbedder and DefaultQueryEmbedder) but I think that would be too restrictive even as a first entry point.

ZanSara · 2023-08-09T08:05:50Z

I understand the idea and I think it's a good direction but I'm not sure how it would help that specific validation issue?
If document and query embedder are split up and users can choose a model for each one of them, then I don't really understand how to provide a default component that would avoid the trap of accidentally selecting 2 different models.

While I understand there is a shift of paradigm and we're asking a bit more from the users, on the other end I have the impression this is not an issue that good documentation and meaningful error messages can't fix. After all, users can still do this mistake in v1 by simply not using the retriever in indexing: often we have to remind people to add it, because it's so unintuitive it looks like a mistake in the docs. This new design asks the users to know they need the same model, that's true, but at least it's going to be easier for them to understand how to use the components.

There are ways to fix this specific problem though: they just don't concern the Embedders. We plan to overhaul our design of the Document classes due to other limitations, so let's take that opportunity to remove the problem entirely: we should make the Documents remember what they were embedded by. This will then allow Retriever to check whether query and documents were embedded using the same model.

Does that sound viable?

mathislucka · 2023-08-09T14:05:08Z

@ZanSara yes that sounds very good. And I don't have anything against putting the burden on the user to select 2 models that match, I just think some kind of validation/warning would make it easier for them to debug a pipeline. Embeddings that have matching dimensions but come from different models will absolutely "work" but the results are going to be really bad and then the user might not figure out why. Your approach sounds very good though and it will enable this type of warning/validation 👍

* first draft * rename * refinements * added clarifications * improvements * improvements * improvements * further improvements * fix typo * Apply suggestions from code review Co-authored-by: Massimiliano Pippi <[email protected]> * adapt to new Canals I/O * fix links to previous proposals * fix * add migration example: update_embeddings * rename EmbeddingService to EmbeddingBackend --------- Co-authored-by: Massimiliano Pippi <[email protected]>

first draft

a0b87ab

github-actions bot added the proposal label Jul 19, 2023

anakin87 added 2 commits July 19, 2023 18:32

rename

abe50d6

refinements

e3fc29f

anakin87 marked this pull request as ready for review July 20, 2023 09:05

anakin87 requested review from a team as code owners July 20, 2023 09:05

anakin87 requested review from vblagoje and removed request for a team July 20, 2023 09:05

sjrl self-requested a review July 21, 2023 08:37

vblagoje reviewed Jul 21, 2023

View reviewed changes

anakin87 requested review from masci, silvanocerza, julian-risch and TuanaCelik July 24, 2023 07:09

anakin87 self-assigned this Jul 24, 2023

TuanaCelik reviewed Jul 24, 2023

View reviewed changes

added clarifications

bdc8ae1

masci reviewed Jul 24, 2023

View reviewed changes

proposals/text/5390-embedders.md Outdated Show resolved Hide resolved

proposals/text/5390-embedders.md Outdated Show resolved Hide resolved

anakin87 added 4 commits July 26, 2023 14:24

improvements

c954da4

improvements

4eda54f

improvements

912c93a

further improvements

6a6e267

fix typo

5520ad9

masci suggested changes Jul 27, 2023

View reviewed changes

anakin87 requested review from masci and vblagoje July 27, 2023 13:49

julian-risch requested changes Jul 31, 2023

View reviewed changes

fix links to previous proposals

b2142d5

fix

d4fae56

add migration example: update_embeddings

c488a12

julian-risch approved these changes Jul 31, 2023

View reviewed changes

vblagoje approved these changes Aug 1, 2023

View reviewed changes

rename EmbeddingService to EmbeddingBackend

494d43b

masci approved these changes Aug 1, 2023

View reviewed changes

ArzelaAscoIi approved these changes Aug 1, 2023

View reviewed changes

agnieszka-m approved these changes Aug 8, 2023

View reviewed changes

anakin87 merged commit 52133d3 into main Aug 9, 2023

anakin87 deleted the proposal-embedders branch August 9, 2023 15:09

ZanSara mentioned this pull request Aug 10, 2023

Migrate Components to Pipeline v2 #5265

Closed

Timoeller added the showcase label Aug 14, 2023

anakin87 mentioned this pull request Sep 13, 2023

Investigate the utility of Embedding Templates #5793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: `Embedders` design #5390

proposal: `Embedders` design #5390

anakin87 commented Jul 19, 2023

vblagoje left a comment •

edited

Loading

ZanSara commented Jul 21, 2023

TuanaCelik left a comment

anakin87 commented Jul 24, 2023

silvanocerza commented Jul 24, 2023

masci left a comment

anakin87 commented Jul 27, 2023

TuanaCelik commented Jul 27, 2023

vblagoje commented Jul 30, 2023

julian-risch left a comment

bilgeyucel commented Jul 31, 2023

anakin87 commented Jul 31, 2023

masci commented Jul 31, 2023 •

edited

Loading

anakin87 commented Jul 31, 2023

julian-risch left a comment

vblagoje left a comment

masci left a comment

ArzelaAscoIi left a comment

mathislucka commented Aug 2, 2023

ZanSara commented Aug 4, 2023 •

edited

Loading

anakin87 commented Aug 4, 2023

masci commented Aug 4, 2023

mathislucka commented Aug 8, 2023

ZanSara commented Aug 9, 2023

mathislucka commented Aug 9, 2023

proposal: Embedders design #5390

proposal: Embedders design #5390

Conversation

anakin87 commented Jul 19, 2023

Related Issues

vblagoje left a comment • edited Loading

Choose a reason for hiding this comment

ZanSara commented Jul 21, 2023

TuanaCelik left a comment

Choose a reason for hiding this comment

anakin87 commented Jul 24, 2023

silvanocerza commented Jul 24, 2023

masci left a comment

Choose a reason for hiding this comment

anakin87 commented Jul 27, 2023

TuanaCelik commented Jul 27, 2023

vblagoje commented Jul 30, 2023

julian-risch left a comment

Choose a reason for hiding this comment

bilgeyucel commented Jul 31, 2023

anakin87 commented Jul 31, 2023

masci commented Jul 31, 2023 • edited Loading

anakin87 commented Jul 31, 2023

julian-risch left a comment

Choose a reason for hiding this comment

vblagoje left a comment

Choose a reason for hiding this comment

masci left a comment

Choose a reason for hiding this comment

ArzelaAscoIi left a comment

Choose a reason for hiding this comment

mathislucka commented Aug 2, 2023

ZanSara commented Aug 4, 2023 • edited Loading

anakin87 commented Aug 4, 2023

masci commented Aug 4, 2023

mathislucka commented Aug 8, 2023

ZanSara commented Aug 9, 2023

mathislucka commented Aug 9, 2023

proposal: `Embedders` design #5390

proposal: `Embedders` design #5390

vblagoje left a comment •

edited

Loading

masci commented Jul 31, 2023 •

edited

Loading

ZanSara commented Aug 4, 2023 •

edited

Loading