Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastEmbed: Sparse Embedding crash if empty chunk #918

Closed
lambda-science opened this issue Jul 23, 2024 · 8 comments
Closed

FastEmbed: Sparse Embedding crash if empty chunk #918

lambda-science opened this issue Jul 23, 2024 · 8 comments
Labels
bug Something isn't working integration:fastembed

Comments

@lambda-science
Copy link
Contributor

lambda-science commented Jul 23, 2024

Describe the bug
Today while embedding a huge document I got this error:

Calculating sparse embeddings:  18%|█▊        | 1152/6362 [00:00<00:03, 1304.11it/s]
2024-07-23T11:52:27.357005349Z 07/23/2024 11:52:27 AM Traceback (most recent call last):
2024-07-23T11:52:27.357065849Z   File "/code/haystack_api/tasks.py", line 53, in index_file_task
2024-07-23T11:52:27.357073663Z     result = indexing_pipeline.run(
2024-07-23T11:52:27.357076959Z              ^^^^^^^^^^^^^^^^^^^^^^
2024-07-23T11:52:27.357079875Z   File "/usr/local/lib/python3.11/site-packages/haystack/core/pipeline/pipeline.py", line 249, in run
2024-07-23T11:52:27.357082950Z     res: Dict[str, Any] = self._run_component(name, last_inputs[name])
2024-07-23T11:52:27.357085876Z                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-23T11:52:27.357088961Z   File "/usr/local/lib/python3.11/site-packages/haystack/core/pipeline/pipeline.py", line 76, in _run_component
2024-07-23T11:52:27.357091997Z     res: Dict[str, Any] = instance.run(**inputs)
2024-07-23T11:52:27.357094852Z                           ^^^^^^^^^^^^^^^^^^^^^^
2024-07-23T11:52:27.357097707Z   File "/usr/local/lib/python3.11/site-packages/haystack_integrations/components/embedders/fastembed/fastembed_sparse_document_embedder.py", line 159, in run
2024-07-23T11:52:27.357100763Z     embeddings = self.embedding_backend.embed(
2024-07-23T11:52:27.357103578Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-23T11:52:27.357106403Z   File "/usr/local/lib/python3.11/site-packages/haystack_integrations/components/embedders/fastembed/embedding_backend/fastembed_backend.py", line 112, in embed
2024-07-23T11:52:27.357109459Z     for sparse_embedding in tqdm(
2024-07-23T11:52:27.357112294Z   File "/usr/local/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
2024-07-23T11:52:27.357115229Z     for obj in iterable:
2024-07-23T11:52:27.357118014Z   File "/usr/local/lib/python3.11/site-packages/fastembed/sparse/sparse_text_embedding.py", line 96, in embed
2024-07-23T11:52:27.357143822Z     yield from self.model.embed(documents, batch_size, parallel, **kwargs)
2024-07-23T11:52:27.357146857Z   File "/usr/local/lib/python3.11/site-packages/fastembed/sparse/bm25.py", line 173, in embed
2024-07-23T11:52:27.357149782Z     yield from self._embed_documents(
2024-07-23T11:52:27.357152638Z   File "/usr/local/lib/python3.11/site-packages/fastembed/sparse/bm25.py", line 134, in _embed_documents
2024-07-23T11:52:27.357155633Z     yield from self.raw_embed(batch)
2024-07-23T11:52:27.357158448Z                ^^^^^^^^^^^^^^^^^^^^^
2024-07-23T11:52:27.357161263Z   File "/usr/local/lib/python3.11/site-packages/fastembed/sparse/bm25.py", line 205, in raw_embed
2024-07-23T11:52:27.357164249Z     embeddings.append(SparseEmbedding.from_dict(token_id2value))
2024-07-23T11:52:27.357167114Z                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-23T11:52:27.357169969Z   File "/usr/local/lib/python3.11/site-packages/fastembed/sparse/sparse_embedding_base.py", line 25, in from_dict
2024-07-23T11:52:27.357173866Z     indices, values = zip(*data.items())
2024-07-23T11:52:27.357176762Z     ^^^^^^^^^^^^^^^
2024-07-23T11:52:27.357179587Z ValueError: not enough values to unpack (expected 2, got 0)
2024-07-23T11:52:27.357182552Z 

Basically sometimes my chunking can be empty (who knows why) which leads to a crash of the component.
It's a behaviour I already noticed in the past when migrating my data to FastEmbed Sparse embedding.
I solved it by doing this

            try:
                for sparse_embedding in sparse_embeddings_iterable:
                    sparse_embeddings.append(sparse_embedding.as_object())
            except:
                sparse_embeddings.append({"indices": [], "values": []})

To add an empty sparse embedding in this case. But in the case of this component, I'm not sure how to do this.

To Reproduce

from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder
query_sparse_embedder = FastembedSparseTextEmbedder(model="Qdrant/bm25")
query_sparse_embedder.warm_up()
query_sparse_embedder.run("")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\cmeyer\code-project\project-pythia\.venv\Lib\site-packages\haystack_integrations\components\embedders\fastembed\fastembed_sparse_text_embedder.py", line 112, in run   
    embedding = self.embedding_backend.embed(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cmeyer\code-project\project-pythia\.venv\Lib\site-packages\haystack_integrations\components\embedders\fastembed\embedding_backend\fastembed_backend.py", line 112, in embed
    for sparse_embedding in tqdm(
  File "C:\Users\cmeyer\code-project\project-pythia\.venv\Lib\site-packages\tqdm\std.py", line 1181, in __iter__
    for obj in iterable:
  File "C:\Users\cmeyer\code-project\project-pythia\.venv\Lib\site-packages\fastembed\sparse\sparse_text_embedding.py", line 96, in embed
    yield from self.model.embed(documents, batch_size, parallel, **kwargs)
  File "C:\Users\cmeyer\code-project\project-pythia\.venv\Lib\site-packages\fastembed\sparse\bm25.py", line 173, in embed
    yield from self._embed_documents(
    yield from self.raw_embed(batch)
               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cmeyer\code-project\project-pythia\.venv\Lib\site-packages\fastembed\sparse\bm25.py", line 205, in raw_embed
    embeddings.append(SparseEmbedding.from_dict(token_id2value))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cmeyer\code-project\project-pythia\.venv\Lib\site-packages\fastembed\sparse\sparse_embedding_base.py", line 25, in from_dict
    indices, values = zip(*data.items())
    ^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 0)

Describe your environment (please complete the following information):

  • OS: [e.g. iOS]: Linux Docker
  • Haystack version: 2.3.0
  • Integration version: 1.2.0
@lambda-science lambda-science added the bug Something isn't working label Jul 23, 2024
@lambda-science
Copy link
Contributor Author

Maybe this should be fixed at the root of the issue in the FastEmbed package itself (@Anush008 ? 👀)

@lambda-science
Copy link
Contributor Author

Note: I guess this also happens when the document content is only space for example (due to some splitter behaviour).
My opinion is that the component shouldn't raise an error in this case but add an empty sparse vector

@Anush008
Copy link
Contributor

Hi @lambda-science.

Basically sometimes my chunking can be empty (who knows why) which leads to a crash of the component.

Can you share a reproducible snippet with FastEmbed. Because I tried doing,

from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="Qdrant/bm25")

next(model.embed(" ")).as_object()
next(model.embed("")).as_object()

I got empty vectors as expected.

@lambda-science
Copy link
Contributor Author

lambda-science commented Jul 23, 2024

Hi @lambda-science.

Basically sometimes my chunking can be empty (who knows why) which leads to a crash of the component.

Can you share a reproducible snippet with FastEmbed. Because I tried doing,

from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="Qdrant/bm25")

next(model.embed(" ")).as_object()
next(model.embed("")).as_object()

I got empty vectors as expected.

So probably it comes from the Haystack integrations. Can you do this:

from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder
query_sparse_embedder = FastembedSparseTextEmbedder(model="Qdrant/bm25")
query_sparse_embedder.warm_up()
query_sparse_embedder.run("")

@lambda-science
Copy link
Contributor Author

Crash: query_sparse_embedder.run("")
Crash: query_sparse_embedder.run("a")
Work: query_sparse_embedder.run("ah")
{'sparse_embedding': SparseEmbedding(indices=[2026917151], values=[1.6877434821696136])}

@Anush008
Copy link
Contributor

So probably it comes from the Haystack integrations. Can you do this:

from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder
query_sparse_embedder = FastembedSparseTextEmbedder(model="Qdrant/bm25")
query_sparse_embedder.warm_up()
query_sparse_embedder.run("")

Works too.
Returns

{'sparse_embedding': SparseEmbedding(indices=[], values=[])}

@lambda-science
Copy link
Contributor Author

So probably it comes from the Haystack integrations. Can you do this:

from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder
query_sparse_embedder = FastembedSparseTextEmbedder(model="Qdrant/bm25")
query_sparse_embedder.warm_up()
query_sparse_embedder.run("")

Works too. Returns

{'sparse_embedding': SparseEmbedding(indices=[], values=[])}

huh ?
Very surprising, I will try upgrade packages

fastembed==0.3.1
fastembed-haystack==1.2.0
haystack-ai==2.3.0

right now

@lambda-science
Copy link
Contributor Author

@Anush008 fastembed==0.3.4 solved it ! Thanks your very much, I'm sorry for bothering ! (Tbh this bug was not mentionned in 0.3.4 patch note ahahah)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integration:fastembed
Projects
None yet
Development

No branches or pull requests

3 participants