Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Embedders components (2.0) #5312

Closed
4 tasks done
Tracked by #5311 ...
ZanSara opened this issue Jul 10, 2023 · 0 comments
Closed
4 tasks done
Tracked by #5311 ...

Implement Embedders components (2.0) #5312

ZanSara opened this issue Jul 10, 2023 · 0 comments
Assignees
Labels
2.x Related to Haystack v2.0
Milestone

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Jul 10, 2023

This component's aim is to create embeddings starting from raw data. In practice, it will take Documents with no embeddings and return an equally long list of Documents with embeddings.

We may have different embedders depending on the embedding strategy/provider. For example HuggingFaceEmbedder, OpenAIEmbedder, ... In this case, more than one PR will be linked to this issue.

Minimal API draft:

@component
class HuggingFaceEmbedder:

    class Input:
        documents: List[Document]
        ... other input params ...

    class Output:
        documents: List[Document])

    def __init__(self, model: str, ... other init params ...):
        self.model_name = model
    
    ....

or alternatively (see Open Questions):

@component
class HuggingFaceEmbedder:

    class Input:
        data: List[<type defined at init time: can be str for text, pd.DataFrame for tables, PIL.Image for images...>]
        ... other input params ...

    class Output:
        embeddings: List[np.ndarray])

    def __init__(self, model: str, ... other init params ...):
        self.model_name = model
    
    ....

Open questions

Embedders need to be used both for indexing pipelines, to add embeddings to a document, and for query pipelines, in front of an EmbeddingRetriever.

graph TD;

subgraph "Query"
IN1([input]) -- "queries (List[str])" --> A1[embedder]
A1 -- "queries (List[np.ndarray])" --> B1[retriever]
B1 -- "documents (List[List[Documents])" --> OUT1([output])
end

subgraph "Indexing"
IN2([input]) -- "paths (List[Path])" --> A2[file converter]
A2 -- "documents (List[Documents])" --> B2[preprocessor]
B2 -- "documents (List[Documents])" --> C2[embedder]
C2 -- "documents (List[Documents])" --> D2[write2store]
end
Loading

The first API draft above is strictly oriented towards indexing, as it takes a list of Documents as input. In that form, it would not be compatible with a query pipeline, which needs to process simple strings and send simple embeddings to MemoryEmbeddingRetriever.

There are several strategies we can go for:

  • Make Embedders work with raw data, not Documents (API draft 2)
    • PRO: they are able to ingest anything, depending on the model given, which makes them extremely flexible
    • CON: at indexing time we need to match the embeddings with Documents in a separate component
graph TD;
IN2([input]) -- "paths (List[Path])" --> A2[file converter]
A2 -- "documents (List[Documents])" --> B2[preprocessor]
B2 -- "data (List[strings])" --> C2[embedder]
B2 -- "documents (List[Documents])" --> E2[match_embeddings_with_documents]
C2 -- "embeddings (List[np.ndarray])" --> E2
E2 -- "documents (List[Documents])" --> D2[write2store]
Loading
  • Create another primitive, like Data, and make Document inherit from it. Then Embedders can deal with Data objects

    • PRO: this abstraction can be reused for all components that work in both indexing and query pipelines
    • CON: we may need more conversion components like DataToDocument
  • Make MemoryEmbeddingRetriever accept a Document as input

    • PRO: works like the above but with one less dataclass
    • CON: unintuitive

Tasks

Preview Give feedback
  1. 2.x
    anakin87
  2. 2.x
    anakin87
  3. 2.x good first issue
    masci
  4. 2.x
    anakin87
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0
Projects
None yet
Development

No branches or pull requests

4 participants