-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(FastEmbed): Support for SPLADE Sparse Embedder #579
Merged
+910
−7
Merged
Changes from 34 commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
afc8e79
fix(opensearch): bulk error without create key
lambda-science 9800f2b
Merge branch 'deepset-ai:main' into main
bf0221c
Merge branch 'deepset-ai:main' into main
aa95f13
feat(FastEmbed): Scaffold for SPLADE Sparse Embedding Support
lambda-science 4b1d8f9
Revert "fix(opensearch): bulk error without create key"
lambda-science 62d8478
feat(FastEmbed): __all__ fix
lambda-science 0e0968a
feat(FastEmbed): fix one test
lambda-science 1feea08
feat(FastEmbed): fix one test
lambda-science e1c5602
feat(FastEmbed): fix a second test
lambda-science a9b3827
feat(FastEmbed): removed old TODO (fixed)
lambda-science 69129c8
feat(FastEmbed): fixing all test + doc
lambda-science 10ea129
fix output typing
8e20cee
Fix output component
d4f836a
feat(FastEmbed): renaming SPLADE to Sparse because it makes more sense
lambda-science 6cb0195
feat(FastEmbed): hatch run all lint
lambda-science a6de1e9
feat(FastEmbed): modify PR for haystack 2.1.0 with proper sparse vectors
lambda-science d37e788
try testing with Haystack main branch
anakin87 9a92c8e
Merge branch 'main' into fastembed-sparse
anakin87 0050a6b
update model name
anakin87 5ea12b5
Update integrations/fastembed/src/haystack_integrations/components/em…
14a8c2d
Update integrations/fastembed/src/haystack_integrations/components/em…
709ac12
Update integrations/fastembed/src/haystack_integrations/components/em…
11f8584
Update integrations/fastembed/src/haystack_integrations/components/em…
727b5ab
Update integrations/fastembed/src/haystack_integrations/components/em…
40cb5b6
feat(FastEmbed): remove prefix/suffix
lambda-science e7e1666
feat(FastEmbed): fix linting
lambda-science 89f857d
feat(FastEmbed): suggestion for progress bar
lambda-science c956ee2
Merge branch 'main' into fastembed-sparse
lambda-science 66bc952
feat(FastEmbed): return Haystack's SparseEmbedding instead of Dict
lambda-science 97dd121
feat(FastEmbed): fix lint
lambda-science bc3f555
feat(Fastembed): run output type from dict to haystack sparseembeddin…
lambda-science 9261122
feat(FastEmbed): reduce default sparse batch size
lambda-science a697433
Update integrations/fastembed/src/haystack_integrations/components/em…
a16fc9d
feat(FastEmbed): fix test
lambda-science d064cc5
Merge branch 'main' into fastembed-sparse
anakin87 a97c4ed
updates after 2.0.1 release
anakin87 1a8c707
small fixes; naive example
anakin87 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
160 changes: 160 additions & 0 deletions
160
...aystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
from typing import Any, Dict, List, Optional | ||
|
||
from haystack import Document, component, default_to_dict | ||
|
||
from .embedding_backend.fastembed_backend import _FastembedSparseEmbeddingBackendFactory | ||
|
||
|
||
@component | ||
class FastembedSparseDocumentEmbedder: | ||
""" | ||
FastembedSparseDocumentEmbedder computes Document embeddings using Fastembed sparse models. | ||
|
||
Usage example: | ||
```python | ||
# To use this component, install the "fastembed-haystack" package. | ||
# pip install fastembed-haystack | ||
|
||
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder | ||
from haystack.dataclasses import Document | ||
|
||
doc_embedder = FastembedSparseDocumentEmbedder( | ||
model="prithvida/Splade_PP_en_v1", | ||
batch_size=32, | ||
) | ||
|
||
doc_embedder.warm_up() | ||
|
||
# Text taken from PubMed QA Dataset (https://huggingface.co/datasets/pubmed_qa) | ||
document_list = [ | ||
Document( | ||
content="Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint destruction. Radical species with oxidative activity, including reactive nitrogen species, represent mediators of inflammation and cartilage damage.", | ||
meta={ | ||
"pubid": "25,445,628", | ||
"long_answer": "yes", | ||
}, | ||
), | ||
Document( | ||
content="Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion and actions are still poorly understood.", | ||
meta={ | ||
"pubid": "25,445,712", | ||
"long_answer": "yes", | ||
}, | ||
), | ||
] | ||
|
||
result = doc_embedder.run(document_list) | ||
print(f"Document Text: {result['documents'][0].content}") | ||
print(f"Document Embedding: {result['documents'][0].sparse_embedding}") | ||
print(f"Embedding Dimension: {len(result['documents'][0].sparse_embedding)}") | ||
``` | ||
""" # noqa: E501 | ||
|
||
def __init__( | ||
self, | ||
model: str = "prithvida/Splade_PP_en_v1", | ||
cache_dir: Optional[str] = None, | ||
threads: Optional[int] = None, | ||
batch_size: int = 32, | ||
progress_bar: bool = True, | ||
parallel: Optional[int] = None, | ||
meta_fields_to_embed: Optional[List[str]] = None, | ||
embedding_separator: str = "\n", | ||
): | ||
""" | ||
Create an FastembedDocumentEmbedder component. | ||
|
||
:param model: Local path or name of the model in Hugging Face's model hub, | ||
such as `prithvida/Splade_PP_en_v1`. | ||
:param cache_dir: The path to the cache directory. | ||
Can be set using the `FASTEMBED_CACHE_PATH` env variable. | ||
Defaults to `fastembed_cache` in the system's temp directory. | ||
:param threads: The number of threads single onnxruntime session can use. | ||
:param batch_size: Number of strings to encode at once. | ||
:param progress_bar: If `True`, displays progress bar during embedding. | ||
:param parallel: | ||
If > 1, data-parallel encoding will be used, recommended for offline encoding of large datasets. | ||
If 0, use all available cores. | ||
If None, don't use data-parallel processing, use default onnxruntime threading instead. | ||
:param meta_fields_to_embed: List of meta fields that should be embedded along with the Document content. | ||
:param embedding_separator: Separator used to concatenate the meta fields to the Document content. | ||
""" | ||
|
||
self.model_name = model | ||
self.cache_dir = cache_dir | ||
self.threads = threads | ||
self.batch_size = batch_size | ||
self.progress_bar = progress_bar | ||
self.parallel = parallel | ||
self.meta_fields_to_embed = meta_fields_to_embed or [] | ||
self.embedding_separator = embedding_separator | ||
|
||
def to_dict(self) -> Dict[str, Any]: | ||
""" | ||
Serializes the component to a dictionary. | ||
:returns: | ||
Dictionary with serialized data. | ||
""" | ||
return default_to_dict( | ||
self, | ||
model=self.model_name, | ||
cache_dir=self.cache_dir, | ||
threads=self.threads, | ||
batch_size=self.batch_size, | ||
progress_bar=self.progress_bar, | ||
parallel=self.parallel, | ||
meta_fields_to_embed=self.meta_fields_to_embed, | ||
embedding_separator=self.embedding_separator, | ||
) | ||
|
||
def warm_up(self): | ||
""" | ||
Initializes the component. | ||
""" | ||
if not hasattr(self, "embedding_backend"): | ||
self.embedding_backend = _FastembedSparseEmbeddingBackendFactory.get_embedding_backend( | ||
model_name=self.model_name, cache_dir=self.cache_dir, threads=self.threads | ||
) | ||
|
||
def _prepare_texts_to_embed(self, documents: List[Document]) -> List[str]: | ||
texts_to_embed = [] | ||
for doc in documents: | ||
meta_values_to_embed = [ | ||
str(doc.meta[key]) for key in self.meta_fields_to_embed if key in doc.meta and doc.meta[key] is not None | ||
] | ||
text_to_embed = self.embedding_separator.join([*meta_values_to_embed, doc.content or ""]) | ||
|
||
texts_to_embed.append(text_to_embed) | ||
return texts_to_embed | ||
|
||
@component.output_types(documents=List[Document]) | ||
def run(self, documents: List[Document]): | ||
""" | ||
Embeds a list of Documents. | ||
|
||
:param documents: List of Documents to embed. | ||
:returns: A dictionary with the following keys: | ||
- `documents`: List of Documents with each Document's `sparse_embedding` | ||
field set to the computed embeddings. | ||
""" | ||
if not isinstance(documents, list) or documents and not isinstance(documents[0], Document): | ||
msg = ( | ||
"FastembedSparseDocumentEmbedder expects a list of Documents as input. " | ||
"In case you want to embed a list of strings, please use the FastembedTextEmbedder." | ||
) | ||
raise TypeError(msg) | ||
if not hasattr(self, "embedding_backend"): | ||
msg = "The embedding model has not been loaded. Please call warm_up() before running." | ||
raise RuntimeError(msg) | ||
|
||
texts_to_embed = self._prepare_texts_to_embed(documents=documents) | ||
embeddings = self.embedding_backend.embed( | ||
texts_to_embed, | ||
batch_size=self.batch_size, | ||
show_progress_bar=self.progress_bar, | ||
parallel=self.parallel, | ||
) | ||
|
||
for doc, emb in zip(documents, embeddings): | ||
doc.sparse_embedding = emb | ||
return {"documents": documents} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lambda-science I temporarily modified the workflow to install Haystack from main.
This way we can experiment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is ready I guess, it's the last thing to do :)