Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Embedders design #5390

Merged
merged 15 commits into from
Aug 9, 2023
226 changes: 226 additions & 0 deletions proposals/text/5390-embedders.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
- Title: Embedders
- Decision driver: @anakin87
- Start Date: 2023-07-19
- Proposal PR: (fill in after opening the PR)

# Summary

As decided in the previous proposals ([Embedding Retriever](3558-embedding_retriever.md) and [DocumentStores and Retrievers](4370-document_stores_and_retrievers.md)), in Haystack V2 we want to introduce a new component: the Embedder.

**Separation of concerns**
- DocumentStores: store the Documents, their metadata and representations (vectors); they offer a CRUD API.
- Retrievers: retrieve Documents from the DocumentStores; they are specific and aware of the used Store (e.g., MemoryRetriever for the MemoryDocumentStore). They will be commonly used in query pipelines (not in indexing pipelines).
- **Embedders**: encode a list of data points (strings, images, etc.) into a list of vectors (i.e., the embeddings) using a model. They are used both in indexing pipelines (to encode the Documents) and query pipelines (to encode the query).

*In the current implementation, the Embedder is part of Retriever, which is unintuitive and comes with several disadvantages (explained in the previous proposals).*

**This proposal aims to define the Embedder design.**

# Basic example

```python
from haystack import Pipeline
from haystack.components import (
TxtConverter,
PreProcessor,
DocumentWriter,
OpenAITextEmbedder,
OpenAIDocumentEmbedder,
MemoryRetriever,
Reader,
)
from haystack.document_stores import MemoryDocumentStore
docstore = MemoryDocumentStore()

indexing_pipe = Pipeline()
indexing_pipe.add_store("document_store", docstore)
indexing_pipe.add_node("txt_converter", TxtConverter())
indexing_pipe.add_node("preprocessor", PreProcessor())
indexing_pipe.add_node("embedder", OpenAIDocumentEmbedder(model_name="text-embedding-ada-002"))
indexing_pipe.add_node("writer", DocumentWriter(store="document_store"))
indexing_pipe.connect("txt_converter", "preprocessor")
indexing_pipe.connect("preprocessor", "embedder")
indexing_pipe.connect("embedder", "writer")

indexing_pipe.run(...)

query_pipe = Pipeline()
query_pipe.add_store("document_store", docstore)
query_pipe.add_node("embedder", OpenAITextEmbedder(model_name="text-embedding-ada-002"))
query_pipe.add_node("retriever", MemoryRetriever(store="document_store", retrieval_method="embedding"))
query_pipe.add_node("reader", Reader(model_name="deepset/model-name"))
query_pipe.connect("embedder", "retriever")
query_pipe.connect("retriever", "reader")

results = query_pipe.run(...)
```

- The `OpenAITextEmbedder` uses OpenAI models to convert a list of strings into a list of vectors. It is used in the query pipeline to embed the query.
- The `OpenAIDocumentEmbedder` uses OpenAI models to enrich a list of Documents with the corresponding vectors (stored in the `embedding` field). It is used in the indexing pipeline to embed the Documents.
- The Retriever is no longer needed in the indexing pipeline.

# Motivation

The motivations behind this change were already provided in the previous proposals ([Embedding Retriever](3558-embedding_retriever.md) and [DocumentStores and Retrievers](4370-document_stores_and_retrievers.md)). Here is a summary:
- Retrievers should't be responsible for embedding Documents.
- Currently, Retrievers have many parameters just to support and configure different underlying Encoders(≈Embedders).
- Adding support for new embedding providers or strategies is difficult. It requires changing the Retriever code.

# Detailed design

## Handle queries and Documents
This is the most critical aspect of the design.

- When embedding queries, we expect the Embedder to transform a list of string into a list of vectors.
anakin87 marked this conversation as resolved.
Show resolved Hide resolved
- When embedding Documents, we expect the Embedder to enrich a list of Documents with the corresponding vectors (stored in the `embedding` field).
anakin87 marked this conversation as resolved.
Show resolved Hide resolved
- For Documents, we may want to embed some meta fields in addition to the content. Therefore, the Embedder should also handle this preparation work, which involves joining all the relevant content into a string to be embedded.
anakin87 marked this conversation as resolved.
Show resolved Hide resolved

There have been much discussion about this point.
@ZanSara formulated the following implementation idea, that I like.
anakin87 marked this conversation as resolved.
Show resolved Hide resolved

We make three classes:
anakin87 marked this conversation as resolved.
Show resolved Hide resolved

- A `BasicEmbedder`, which is NOT a component, handling raw data + a factory method to reuse instances
```python
class HFBasicEmbedder:
masci marked this conversation as resolved.
Show resolved Hide resolved
"""
NOT A COMPONENT!
masci marked this conversation as resolved.
Show resolved Hide resolved
"""

instances: List[HFEmbedder] = []

def __new__(cls, *args, **kwargs):
"""
Factory method.
If an instance with the same identical params was already created,
return that instance instead of initializing a new one.
"""
if <hash of name and init params> in Embedder.instances:
return HFBasicEmbedder.instances[<hash of name and init params>]

embedder = cls(*args, **kwargs)
HFBasicEmbedder.instances[<hash of name and init params>] = embedder
return embedder

def __init__(self, model_name: str, ... init params ...):
"""
init takes the minimum parameters needed at init time, not
the params needed at inference, so they're easier to reuse.
"""
self.model = ...

def embed(self, data: str, ... inference params ... ) -> np.ndarray:
# compute embedding
return embedding


class OpenAIEmbedder:
... same as above ...
```

Given that Embedders are created through a factory method, when you request an instance, if another identical exists, the method returns that instance instead of a new one.

This makes model reusability automatic in all cases, which can save lots of memory without asking the user to think about it.

- A `(Text/Table/Image/Audio)Embedder` component that does nothing but “wrapping” an Embedder
```python
@component
class HFTextEmbedder:

class Input:
data: List[str]

class Output
embeddings: List[np.ndarray]

def __init__(self, model_name: str, ... init params ...):
self.model_name = model_name
self.model_params = ... params ...

def warm_up(self):
self.embedder = HFBasicEmbedder(self.model_name, **self.model_params)

def run(self, data):
return self.output(self.embedder.embed(data.data))
```

- A DocumentEmbedder component that handles the documents and the offloads the computation to an embedder

```python
@component
class HFDocumentEmbedder(HFTextEmbedder):
anakin87 marked this conversation as resolved.
Show resolved Hide resolved
"""
Note: in this toy example inheritance from HFTextEmbedder makes sense because
init and warm_up are identical. If they would differ significantly, let's remove
the inheritance to simplify the architecture: it's not mandatory for the rest
of the system to work.
"""

class Input:
documents: List[Document]

class Output
documents: List[np.ndarray]

def run(self, data):
text_strings = [document.content for document in data.documents]
embeddings = self.embedder.embed(text_strings)
documents_with_embeddings = [Document.from_dict(**doc.to_dict, "embedding": emb) for doc, emb in zip(documents, embeddings)]
return self.output(documents = documents_with_embeddings)
```

## Different providers/strategies

- We can define different classes depending on the providers: `OpenAIEmbedder`, `CohereEmbedder`, `HuggingFaceEmbedder`, `SentenceTransformersEmbedder`, etc.
- Additionally, we could define different classes depending on the embedding strategy if necessary.
While this is not a prominent use case, there are scenarios where [new strategies](https://github.com/deepset-ai/haystack/issues/5242) are introduced, requiring different libraries (`InstructorEmbedder`) or involving a different string preparation (`E5Embedder`). Supporting these scenarios with minimal effort would be nice.

## Different models in the same embedding/retrieval task

As you can observe from the [current implementation](https://github.com/deepset-ai/haystack/blob/main/haystack/nodes/retriever/dense.py), some embedding/retrieval tasks require the usage of different models.

This is not the most popular approach today, compared to what we call Embedding Retrieval (based on a single model). But it still has some relevant applications.

Some examples:
- In Dense Passage Retrieval, you need a model to encode queries and another model to encode Documents
- in the TableTextRetriever, we use 3 different models: one for queries, one for textual passages and one for tables
- in Multimodal Retrieval, we can specify different models to encode queries and Documents

Since the Embedder will not be included in the Retriever, it makes sense to have different Embedders, each one using a single model.

```python
dpr_query_embedder = SentenceTransformersTextEmbedder(model_name="facebook/dpr-question_encoder-single-nq-base")
dpr_doc_embedder = SentenceTransformersDocumentEmbedder(model_name="facebook/dpr-ctx_encoder-single-nq-base")
```

# Drawbacks

The drawbacks of concept separation between Retrievers and Embedders were discussed in [this proposal](https://github.com/deepset-ai/haystack/blob/main/proposals/text/4370-documentstores-and-retrievers.md) and mainly consist of **migration effort**.
anakin87 marked this conversation as resolved.
Show resolved Hide resolved

Several ideas were discussed (mostly concerning how to handle queries and Documents in the Embedders) and this seems to be the best approach. However, there are some potential cons to consider:
anakin87 marked this conversation as resolved.
Show resolved Hide resolved
- Proliferation of classes (though they will be small and easy to maintain).
- Users need to properly understand which models are appropriate for a task or for embedding queries vs. Documents (see [Different models in the same embedding/retrieval task](#different-models-in-the-same-embeddingretrieval-task)). On the other hand, this can help raise their awareness.
anakin87 marked this conversation as resolved.
Show resolved Hide resolved

# Alternatives

Several alternatives to this design were considered. The main challenge was handling the differences between queries and Documents.
Some ideas:
- Have a single Embedder component for text (HFEmbedder instead of HFEmbedder, HFTextEmbedder and HFDocumentEmbedder) and adapt Documents before and after that, using other Components. --> Many components.
- Make Embedders only work on Documents and represent the query as a Document. --> Unintuitive and require changes in the Retriever.
- Create another primitive like Data (content + embedding) and use it for both queries and Documents. --> More conversion components like DataToDocument.
- Have the DocumentEmbedder take a Basic Embedder as an input parameter. --> Fewer classes but serialization issues.

# Adoption strategy

This change will constitute a part of Haystack v2.

# How we teach this

Documentation and tutorials will be of fundamental importance.

# Unresolved questions

- Migration and refactoring of existing Encoders hidden in Retrievers.
I prepared a table. Should it be shared here?
- The migration and refactoring of TableTextRetriever require input and ownership from people involved in TableQA.
- How to approach MultiModal Embedding? How many classes? Take into consideration that a query could also be an Image or a Table.