Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(FastEmbed): Support for SPLADE Sparse Embedder #579

Merged
merged 37 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
afc8e79
fix(opensearch): bulk error without create key
lambda-science Mar 6, 2024
9800f2b
Merge branch 'deepset-ai:main' into main
Mar 6, 2024
bf0221c
Merge branch 'deepset-ai:main' into main
Mar 6, 2024
aa95f13
feat(FastEmbed): Scaffold for SPLADE Sparse Embedding Support
lambda-science Mar 13, 2024
4b1d8f9
Revert "fix(opensearch): bulk error without create key"
lambda-science Mar 13, 2024
62d8478
feat(FastEmbed): __all__ fix
lambda-science Mar 13, 2024
0e0968a
feat(FastEmbed): fix one test
lambda-science Mar 13, 2024
1feea08
feat(FastEmbed): fix one test
lambda-science Mar 13, 2024
e1c5602
feat(FastEmbed): fix a second test
lambda-science Mar 13, 2024
a9b3827
feat(FastEmbed): removed old TODO (fixed)
lambda-science Mar 13, 2024
69129c8
feat(FastEmbed): fixing all test + doc
lambda-science Mar 13, 2024
10ea129
fix output typing
Mar 13, 2024
8e20cee
Fix output component
Mar 13, 2024
d4f836a
feat(FastEmbed): renaming SPLADE to Sparse because it makes more sense
lambda-science Mar 14, 2024
6cb0195
feat(FastEmbed): hatch run all lint
lambda-science Mar 14, 2024
a6de1e9
feat(FastEmbed): modify PR for haystack 2.1.0 with proper sparse vectors
lambda-science Mar 20, 2024
d37e788
try testing with Haystack main branch
anakin87 Mar 21, 2024
9a92c8e
Merge branch 'main' into fastembed-sparse
anakin87 Mar 21, 2024
0050a6b
update model name
anakin87 Mar 21, 2024
5ea12b5
Update integrations/fastembed/src/haystack_integrations/components/em…
Mar 21, 2024
14a8c2d
Update integrations/fastembed/src/haystack_integrations/components/em…
Mar 21, 2024
709ac12
Update integrations/fastembed/src/haystack_integrations/components/em…
Mar 21, 2024
11f8584
Update integrations/fastembed/src/haystack_integrations/components/em…
Mar 21, 2024
727b5ab
Update integrations/fastembed/src/haystack_integrations/components/em…
Mar 21, 2024
40cb5b6
feat(FastEmbed): remove prefix/suffix
lambda-science Mar 21, 2024
e7e1666
feat(FastEmbed): fix linting
lambda-science Mar 21, 2024
89f857d
feat(FastEmbed): suggestion for progress bar
lambda-science Mar 21, 2024
c956ee2
Merge branch 'main' into fastembed-sparse
lambda-science Mar 22, 2024
66bc952
feat(FastEmbed): return Haystack's SparseEmbedding instead of Dict
lambda-science Mar 22, 2024
97dd121
feat(FastEmbed): fix lint
lambda-science Mar 22, 2024
bc3f555
feat(Fastembed): run output type from dict to haystack sparseembeddin…
lambda-science Mar 22, 2024
9261122
feat(FastEmbed): reduce default sparse batch size
lambda-science Mar 22, 2024
a697433
Update integrations/fastembed/src/haystack_integrations/components/em…
Mar 22, 2024
a16fc9d
feat(FastEmbed): fix test
lambda-science Mar 22, 2024
d064cc5
Merge branch 'main' into fastembed-sparse
anakin87 Apr 10, 2024
a97c4ed
updates after 2.0.1 release
anakin87 Apr 10, 2024
1a8c707
small fixes; naive example
anakin87 Apr 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/fastembed.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ jobs:

- name: Run tests
id: tests
run: hatch run cov
run: |
hatch run pip install git+https://github.com/deepset-ai/haystack.git #TODO: rm before merging
hatch run cov
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lambda-science I temporarily modified the workflow to install Haystack from main.
This way we can experiment...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is ready I guess, it's the last thing to do :)


- name: Nightly - run unit tests with Haystack main branch
if: github.event_name == 'schedule'
Expand All @@ -60,4 +62,4 @@ jobs:
core-integrations failure:
${{ (steps.tests.conclusion == 'nightly-haystack-main') && 'nightly-haystack-main' || 'tests' }}
- ${{ github.workflow }}
api-key: ${{ secrets.CORE_DATADOG_API_KEY }}
api-key: ${{ secrets.CORE_DATADOG_API_KEY }}
25 changes: 25 additions & 0 deletions integrations/fastembed/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,31 @@ doc = Document(content="fastembed is supported by and maintained by Qdrant.", me
result = embedder.run(documents=[doc])
```

You can use `FastembedSparseTextEmbedder` and `FastembedSparseDocumentEmbedder` by importing as:

```python
from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder

text = "fastembed is supported by and maintained by Qdrant."
text_embedder = FastembedSparseTextEmbedder(
model="prithvida/Splade_PP_en_v1"
)
text_embedder.warm_up()
embedding = text_embedder.run(text)["embedding"]
```

```python
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder
from haystack.dataclasses import Document

embedder = FastembedSparseDocumentEmbedder(
model="prithvida/Splade_PP_en_v1",
)
embedder.warm_up()
doc = Document(content="fastembed is supported by and maintained by Qdrant.", meta={"long_answer": "no",})
result = embedder.run(documents=[doc])
```

## License

`fastembed-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.
2 changes: 1 addition & 1 deletion integrations/fastembed/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ classifiers = [
]
dependencies = [
"haystack-ai",
"fastembed>=0.2",
"fastembed>=0.2.4",
]

[project.urls]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@
#
# SPDX-License-Identifier: Apache-2.0
from .fastembed_document_embedder import FastembedDocumentEmbedder
from .fastembed_sparse_document_embedder import FastembedSparseDocumentEmbedder
from .fastembed_sparse_text_embedder import FastembedSparseTextEmbedder
from .fastembed_text_embedder import FastembedTextEmbedder

__all__ = ["FastembedDocumentEmbedder", "FastembedTextEmbedder"]
__all__ = [
"FastembedDocumentEmbedder",
"FastembedTextEmbedder",
"FastembedSparseDocumentEmbedder",
"FastembedSparseTextEmbedder",
]
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from typing import ClassVar, Dict, List, Optional
from typing import ClassVar, Dict, List, Optional, Union

from tqdm import tqdm

from fastembed import TextEmbedding
from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding


class _FastembedEmbeddingBackendFactory:
Expand Down Expand Up @@ -50,3 +51,52 @@ def embed(self, data: List[str], progress_bar=True, **kwargs) -> List[List[float
):
embeddings.append(np_array.tolist())
return embeddings


class _FastembedSparseEmbeddingBackendFactory:
"""
Factory class to create instances of fastembed sparse embedding backends.
"""

_instances: ClassVar[Dict[str, "_FastembedSparseEmbeddingBackend"]] = {}

@staticmethod
def get_embedding_backend(
model_name: str,
cache_dir: Optional[str] = None,
threads: Optional[int] = None,
):
embedding_backend_id = f"{model_name}{cache_dir}{threads}"

if embedding_backend_id in _FastembedSparseEmbeddingBackendFactory._instances:
return _FastembedSparseEmbeddingBackendFactory._instances[embedding_backend_id]

embedding_backend = _FastembedSparseEmbeddingBackend(
model_name=model_name, cache_dir=cache_dir, threads=threads
)
_FastembedSparseEmbeddingBackendFactory._instances[embedding_backend_id] = embedding_backend
return embedding_backend


class _FastembedSparseEmbeddingBackend:
"""
Class to manage fastembed sparse embeddings.
"""

def __init__(
self,
model_name: str,
cache_dir: Optional[str] = None,
threads: Optional[int] = None,
):
self.model = SparseTextEmbedding(model_name=model_name, cache_dir=cache_dir, threads=threads)

def embed(self, data: List[List[str]], **kwargs) -> List[Dict[str, Union[List[int], List[float]]]]:
# The embed method returns a Iterable[SparseEmbedding], so we convert it to a list of dictionaries.
# Each dict contains an `indices` key containing a list of int and an `values` key containing a list of floats.
sparse_embeddings = [sparse_embedding.as_object() for sparse_embedding in self.model.embed(data, **kwargs)]
for embedding in sparse_embeddings:
embedding["indices"] = embedding["indices"].tolist()
embedding["values"] = embedding["values"].tolist()
lambda-science marked this conversation as resolved.
Show resolved Hide resolved

return sparse_embeddings
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
from typing import Any, Dict, List, Optional

from haystack import Document, component, default_to_dict

from .embedding_backend.fastembed_backend import _FastembedSparseEmbeddingBackendFactory


@component
class FastembedSparseDocumentEmbedder:
"""
FastembedSparseDocumentEmbedder computes Document embeddings using Fastembed sparse models.

Usage example:
```python
# To use this component, install the "fastembed-haystack" package.
# pip install fastembed-haystack

from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder
from haystack.dataclasses import Document

doc_embedder = FastembedSparseDocumentEmbedder(
model="prithvida/Splade_PP_en_v1",
batch_size=256,
)

doc_embedder.warm_up()

# Text taken from PubMed QA Dataset (https://huggingface.co/datasets/pubmed_qa)
document_list = [
Document(
content="Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint destruction. Radical species with oxidative activity, including reactive nitrogen species, represent mediators of inflammation and cartilage damage.",
meta={
"pubid": "25,445,628",
"long_answer": "yes",
},
),
Document(
content="Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion and actions are still poorly understood.",
meta={
"pubid": "25,445,712",
"long_answer": "yes",
},
),
]

result = doc_embedder.run(document_list)
print(f"Document Text: {result['documents'][0].content}")
print(f"Document Embedding: {result['documents'][0].sparse_embedding}")
print(f"Embedding Dimension: {len(result['documents'][0].sparse_embedding)}")
```
""" # noqa: E501

def __init__(
self,
model: str = "prithvida/Splade_PP_en_v1",
cache_dir: Optional[str] = None,
threads: Optional[int] = None,
prefix: str = "",
suffix: str = "",
lambda-science marked this conversation as resolved.
Show resolved Hide resolved
batch_size: int = 256,
lambda-science marked this conversation as resolved.
Show resolved Hide resolved
progress_bar: bool = True,
parallel: Optional[int] = None,
meta_fields_to_embed: Optional[List[str]] = None,
embedding_separator: str = "\n",
):
"""
Create an FastembedDocumentEmbedder component.

:param model: Local path or name of the model in Hugging Face's model hub,
such as `prithvida/Splade_PP_en_v1`.
:param cache_dir: The path to the cache directory.
Can be set using the `FASTEMBED_CACHE_PATH` env variable.
Defaults to `fastembed_cache` in the system's temp directory.
:param threads: The number of threads single onnxruntime session can use. Defaults to None.
lambda-science marked this conversation as resolved.
Show resolved Hide resolved
:param prefix: A string to add to the beginning of each text.
:param suffix: A string to add to the end of each text.
:param batch_size: Number of strings to encode at once.
:param progress_bar: If true, displays progress bar during embedding.
lambda-science marked this conversation as resolved.
Show resolved Hide resolved
:param parallel:
If > 1, data-parallel encoding will be used, recommended for offline encoding of large datasets.
If 0, use all available cores.
If None, don't use data-parallel processing, use default onnxruntime threading instead.
:param meta_fields_to_embed: List of meta fields that should be embedded along with the Document content.
:param embedding_separator: Separator used to concatenate the meta fields to the Document content.
"""

self.model_name = model
self.cache_dir = cache_dir
self.threads = threads
self.prefix = prefix
self.suffix = suffix
self.batch_size = batch_size
self.progress_bar = progress_bar
self.parallel = parallel
self.meta_fields_to_embed = meta_fields_to_embed or []
self.embedding_separator = embedding_separator

def to_dict(self) -> Dict[str, Any]:
"""
Serializes the component to a dictionary.
:returns:
Dictionary with serialized data.
"""
return default_to_dict(
self,
model=self.model_name,
cache_dir=self.cache_dir,
threads=self.threads,
prefix=self.prefix,
suffix=self.suffix,
batch_size=self.batch_size,
progress_bar=self.progress_bar,
parallel=self.parallel,
meta_fields_to_embed=self.meta_fields_to_embed,
embedding_separator=self.embedding_separator,
)

def warm_up(self):
"""
Initializes the component.
"""
if not hasattr(self, "embedding_backend"):
self.embedding_backend = _FastembedSparseEmbeddingBackendFactory.get_embedding_backend(
model_name=self.model_name, cache_dir=self.cache_dir, threads=self.threads
)

def _prepare_texts_to_embed(self, documents: List[Document]) -> List[str]:
texts_to_embed = []
for doc in documents:
meta_values_to_embed = [
str(doc.meta[key]) for key in self.meta_fields_to_embed if key in doc.meta and doc.meta[key] is not None
]
text_to_embed = [
self.prefix + self.embedding_separator.join([*meta_values_to_embed, doc.content or ""]) + self.suffix,
]

texts_to_embed.append(text_to_embed[0])
lambda-science marked this conversation as resolved.
Show resolved Hide resolved
return texts_to_embed

@component.output_types(documents=List[Document])
def run(self, documents: List[Document]):
"""
Embeds a list of Documents.

:param documents: List of Documents to embed.
:returns: A dictionary with the following keys:
- `documents`: List of Documents
to the computed embeddings.
lambda-science marked this conversation as resolved.
Show resolved Hide resolved
"""
if not isinstance(documents, list) or documents and not isinstance(documents[0], Document):
msg = (
"FastembedSparseDocumentEmbedder expects a list of Documents as input. "
"In case you want to embed a list of strings, please use the FastembedTextEmbedder."
)
raise TypeError(msg)
if not hasattr(self, "embedding_backend"):
msg = "The embedding model has not been loaded. Please call warm_up() before running."
raise RuntimeError(msg)

texts_to_embed = self._prepare_texts_to_embed(documents=documents)
embeddings = self.embedding_backend.embed(
texts_to_embed,
batch_size=self.batch_size,
show_progress_bar=self.progress_bar,
parallel=self.parallel,
)

for doc, emb in zip(documents, embeddings):
doc.sparse_embedding = emb
return {"documents": documents}
Loading