Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Embedders design #5390

Merged
merged 15 commits into from
Aug 9, 2023
Merged

proposal: Embedders design #5390

merged 15 commits into from
Aug 9, 2023

Conversation

anakin87
Copy link
Member

Related Issues

This proposal aims to define the Embedder design in Haystack v2.
Let's discuss...

@anakin87 anakin87 marked this pull request as ready for review July 20, 2023 09:05
@anakin87 anakin87 requested review from a team as code owners July 20, 2023 09:05
@anakin87 anakin87 requested review from vblagoje and removed request for a team July 20, 2023 09:05
@sjrl sjrl self-requested a review July 21, 2023 08:37
Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main observation and needing to be confident what is the right path I'm divided between having two embedding classes for embedder per model type: i.e _Model_TextEmbedder and _Model_DocumentEmbedder. Or one Embedder that handles both str and Document list. Can we discuss pros/cons for this issue?

@ZanSara
Copy link
Contributor

ZanSara commented Jul 21, 2023

@vblagoje: me and @anakin87 had quite a few exchanges on this, you can find them back on the Notion page https://www.notion.so/deepsetai/Embedders-design-aae6bf8628ee4bf59a6779703ce3fb06

TL;DR: the two have quite a different interface, so mixing them would be awkward.

To be specific: a DocumentEmbedder has a lot more parameters about what exactly to embed of the incoming document, like metadata, how much of the metadata, how to embed meta and content together (prefixing the meta? Or suffixing it? How to serialize the meta before attaching it to the content? etc etc). While the basic Embedder that accepts only text needs none of those parameters. So merging the two interfaces is convenient only in indexing, while it's very noisy for every other usecase that requires only a simple embedder of strings, like querying.

On top of this, both components are going to be quite small, because they just use the Embedder class, which is not a component, and the Embedder class will take care of reusing the models. In this scenario, components are just "interfaces" between the Pipeline and the Embedder, so it makes sense that we make one for each scenario.

I hope that explains our reasoning! Is there some pitfall we haven't considered in your opinion? The only one I imagined was class proliferation, but after seeing the list of expected components, I'm reassured we're not going to have an excessive number of them. That's subjective though.

Copy link
Contributor

@TuanaCelik TuanaCelik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number 1: I love this proposal. Here are a few things that will make it a lot more intuitive to people, which are mainly an agreement to what you've already pointed out:

  • No longer having to use the retriever component to embed documents to document stores
  • Makes it easier to understand and see the relevance of the 'embedding' concept and the difference of it in the indexing vs querying pipelines.

But here's one comment to think about from my side:
I don't think the point of having an Embedder class vs an Embedder component that wraps that class comes across in this proposal. I already have this question: What's the purpose of this distinction. I'm sure you have one, but the 2 concepts are a bit too mixed together in the current proposal IMHO. I would predict that if we loosely refer to both as embedder as in this proposal we will have a hard time explaining to people what is what and how to create your own embedded class vs component.

@anakin87
Copy link
Member Author

Hey @TuanaCelik... Thanks for your comments!

  • the HFBasicEmbedder class is for internal use only. (We could also think of naming it _HFBasicEmbedder). It takes care of creating and reusing instances to not waste memory.
  • the user should only use HFTextEmbedder for queries and HFDocumentEmbedder for Documents.

I will update the proposal to better clarify this point...

@silvanocerza
Copy link
Contributor

I do partly agree with Tuana, it might be confusing having a class HFBasicEmbedder that is not a component.
Also we must make it clear that is not part of the public API and only an implementation detail.

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the need of a base class that's not a component, so apologies if my comments were already discussed. I find the concept of a basic, non-component version of the embedder hard to grasp, I hope we can find another way.

proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
@anakin87
Copy link
Member Author

After a discussion with @ZanSara, @masci, and @julian-risch, I made revisions to the proposal to improve it.
Feel free to review it...

@TuanaCelik
Copy link
Contributor

@anakin87 I really prefer the distinction of the EmbedderService vs the Embedder in the updated proposal.
Just one heads up. The Output in each component is missing : :)

proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
proposals/text/5390-embedders.md Outdated Show resolved Hide resolved
@anakin87 anakin87 requested review from masci and vblagoje July 27, 2023 13:49
@vblagoje
Copy link
Member

This proposal is very clear and seems to take the best path forward; great work all!

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal looks very good to me already! 👍 Going forward and for further clarification, it would be great if you address two small things: What if a user has indexed documents in the store and now want's to update the embeddings by calculating them with a different model instead. Ideally we don't need to run the full indexing pipeline with the preprocessing again.
For example, with the current Haystack implementation the user would run document_store.update_embeddings. What's your plan for v2?
For the EmbeddingService singleton, I'd suggest to get feedback from the platform team too to better understand how reusing the service across different pipelines affects them and how much flexibility they need. For speeding up indexing they sometimes might want to run the same model multiple times in parallel for just one pipeline.

@bilgeyucel
Copy link
Contributor

Thank you for the detailed proposal, @anakin87! After reading the proposal and the conversation above, it makes sense to have TextEmbedder and DocumentEmbedder separately. I have one question here, not a big one, though. Do you foresee a different use case for _model_TextEmbedder apart from a query pipeline? If not, would it make sense to rename it as _model_QueryEmbedder?
My initial reason for this is that DocumentEmbedder name makes sense as I use it for Documents but TextEmbedder sounds like it can be used for embedding anything.

@anakin87
Copy link
Member Author

What if a user has indexed documents in the store and now want's to update the embeddings by calculating them with a different model instead. Ideally we don't need to run the full indexing pipeline with the preprocessing again. For example, with the current Haystack implementation the user would run document_store.update_embeddings. What's your plan for v2?

Based on the current MemoryDocumentStore implementation, I can imagine something like this:

# get all the documents
docs = memory_document_store.filter_documents()

# compute the embedding with the new model
new_embedder = HFDocumentEmbedder(model_name="new-model")
docs_with_embeddings = new_embedder.run(documents=docs)

# overwrite the documents
memory_document_store.write_documents(documents=docs_with_embeddings, policy=DuplicatePolicy.OVERWRITE)

@julian-risch does this approach seem reasonable or do you have any suggestions for improving it? Should I add this point to the proposal?

For the EmbeddingService singleton, I'd suggest to get feedback from the platform team too to better understand how reusing the service across different pipelines affects them and how much flexibility they need. For speeding up indexing they sometimes might want to run the same model multiple times in parallel for just one pipeline.

I asked the platform team for feedback!

@masci
Copy link
Contributor

masci commented Jul 31, 2023

@anakin87 let's add the migration snippet from your previous comment to the proposal. You already mention migration in the "Drawbacks" section, that's a good place to add it.

@anakin87
Copy link
Member Author

Do you foresee a different use case for _model_TextEmbedder apart from a query pipeline? If not, would it make sense to rename it as _model_QueryEmbedder?
My initial reason for this is that DocumentEmbedder name makes sense as I use it for Documents but TextEmbedder sounds like it can be used for embedding anything.

@bilgeyucel thanks for sharing your observation. I understand that the naming in the proposal may not be straightforward for users.

We considered the current naming with the understanding that in multimodal retrieval, the query may not be textual. It may make sense to have several (query) embedders: _model_TextEmbedder, _model_TableEmbedder, _model_ImageEmbedder...

If you have any better ideas for the naming, taking all factors into consideration, please suggest them!

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍 Thanks for including the update_embeddings functionality in the proposal. Depending on how frequently it is used, we might think about a simpler way again later. For example, in the old implementation we didn't need to have the embeddings of all the documents in memory.
Thank you also for reaching out to platform for feedback on the singleton. From my side, nothing is blocking the proposal and I hope some work can be scheduled for the next sprint already! 🚀

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great team work, awesome lead @anakin87

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Member

@ArzelaAscoIi ArzelaAscoIi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Looks good to me :)

@mathislucka
Copy link
Member

I have a few more things to add. I'm not sure if some of these should end up in other components but I want to leave them here in case they are relevant.

  1. How do we ensure that query and document embedding match?
    In most cases, the query and the document should be embedded with the same model. This setup might make it easier to accidentally embed the query with a model that is different from the model that is used for the document. It would be great if this could be validated somehow. It's a subtle mistake to make but I could see Haystack users struggling to find the cause of bad pipeline performance, until they realize, that they changed the embedding model on the query pipeline but not in indexing or vice versa.

  2. What about storing multiple embeddings for a document?
    Multi-embedding retrieval is gaining more traction and there are many cases where it might make sense to store multiple embeddings per document (multimodal, embed a summary and the full document, embed other metadata). If the embedder can only store the embedding in a documents embedding field, multiple embeddings won't be possible.

  3. Templates for embedders
    Sometimes, it is important to embed metadata with the content of a document or to pre- or post-fix the document or query with a string. As an example, https://huggingface.co/intfloat/e5-large-v2 requires to pre-pend question/passage before the query and document: question: {{query}} and passage: {{document.content}}. In other cases, we might want to embed meta data in a meaningful way. For example, This article talks about {{doc.meta["company"]}}, it was published on {{doc.meta["publication_date"]}}. Here is the article's content: {{doc.content}}. It would be great if this kind of flexibility could be incorporated into the proposal.

@ZanSara
Copy link
Contributor

ZanSara commented Aug 4, 2023

Hey @mathislucka, thank you for your comments! I'll let @anakin87 reply to you in details, while I am especially curious about point 2. It is currently not doable with Haystack v1, right?

It would be amazing to have multi-embedding support in v2, but I wonder how many document store engines will actually support that, and how. Do you know of some that surely do? Some specific use cases we can use to understand the requirements better? I am going to open a Notion page to discuss this feature separately, as I believe it may be worth its own small proposal.

https://www.notion.so/deepsetai/Multi-Embedding-Support-1e8b3ed738744218bdb2b2c53edc43a3

@anakin87
Copy link
Member Author

anakin87 commented Aug 4, 2023

Hey @mathislucka... Thanks for your observations.

About point 2 (multiple embeddings per Document): I am not under the impression that this feature is becoming increasingly popular or that many vector DBs support it, but let's discuss the potential use cases on the Notion page.

  1. How do we ensure that query and document embedding match?
    I understand your point.
    The idea behind the proposal (and behind some aspects of V2) is to make things less magical and more explicit. So, we hope to help the user to become more aware of what he or she is doing.
    In this sense, I wouldn't validate this aspect...

  2. Templates for embedders
    All of your examples make sense and are useful to guide development.
    Most of them can be incorporated in the preparation work done by the DocumentEmbedder to get a customized string to embed, starting from the Document.
    I don't know the implementation details yet, but we will work in that direction...

@masci
Copy link
Contributor

masci commented Aug 4, 2023

  1. The idea behind the proposal (and behind some aspects of V2) is to make things less magical and more explicit. So, we hope to help the user to become more aware of what he or she is doing.

The idea is also to provide "high level" objects in Haystack that wrap these "low level" components with guarded defaults (this is some sort of design strategy we want to apply to other abstractions too, like pipelines). So imagine a DPRDefaultEmbedder component available in the Haystack API that hides all of the embedder details, included the model (a bit extreme but say we pick a generic one for our users). Users will be exposed to the lower level details only when they need to tweak or customize this default component, but at that point we can assume a certain degree of awareness and being explicit in our API will be an advantage.

@mathislucka
Copy link
Member

The idea is also to provide "high level" objects in Haystack that wrap these "low level" components with guarded defaults (this is some sort of design strategy we want to apply to other abstractions too, like pipelines). So imagine a DPRDefaultEmbedder component available in the Haystack API that hides all of the embedder details, included the model (a bit extreme but say we pick a generic one for our users). Users will be exposed to the lower level details only when they need to tweak or customize this default component, but at that point we can assume a certain degree of awareness and being explicit in our API will be an advantage.

I understand the idea and I think it's a good direction but I'm not sure how it would help that specific validation issue?

If document and query embedder are split up and users can choose a model for each one of them, then I don't really understand how to provide a default component that would avoid the trap of accidentally selecting 2 different models. Unless they can't select any model at all (DefaultDocumentEmbedder and DefaultQueryEmbedder) but I think that would be too restrictive even as a first entry point.

@ZanSara
Copy link
Contributor

ZanSara commented Aug 9, 2023

I understand the idea and I think it's a good direction but I'm not sure how it would help that specific validation issue?
If document and query embedder are split up and users can choose a model for each one of them, then I don't really understand how to provide a default component that would avoid the trap of accidentally selecting 2 different models.

While I understand there is a shift of paradigm and we're asking a bit more from the users, on the other end I have the impression this is not an issue that good documentation and meaningful error messages can't fix. After all, users can still do this mistake in v1 by simply not using the retriever in indexing: often we have to remind people to add it, because it's so unintuitive it looks like a mistake in the docs. This new design asks the users to know they need the same model, that's true, but at least it's going to be easier for them to understand how to use the components.

There are ways to fix this specific problem though: they just don't concern the Embedders. We plan to overhaul our design of the Document classes due to other limitations, so let's take that opportunity to remove the problem entirely: we should make the Documents remember what they were embedded by. This will then allow Retriever to check whether query and documents were embedded using the same model.

Does that sound viable?

@mathislucka
Copy link
Member

@ZanSara yes that sounds very good. And I don't have anything against putting the burden on the user to select 2 models that match, I just think some kind of validation/warning would make it easier for them to debug a pipeline. Embeddings that have matching dimensions but come from different models will absolutely "work" but the results are going to be really bad and then the user might not figure out why. Your approach sounds very good though and it will enable this type of warning/validation 👍

@anakin87 anakin87 merged commit 52133d3 into main Aug 9, 2023
@anakin87 anakin87 deleted the proposal-embedders branch August 9, 2023 15:09
DosticJelena pushed a commit to smartcat-labs/haystack that referenced this pull request Aug 23, 2023
* first draft

* rename

* refinements

* added clarifications

* improvements

* improvements

* improvements

* further improvements

* fix typo

* Apply suggestions from code review

Co-authored-by: Massimiliano Pippi <[email protected]>

* adapt to new Canals I/O

* fix links to previous proposals

* fix

* add migration example: update_embeddings

* rename EmbeddingService to EmbeddingBackend

---------

Co-authored-by: Massimiliano Pippi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ignore-for-release-notes PRs with this flag won't be included in the release notes. proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.