Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The filters are not working with metadata that contain a space. #616

Closed
PAHXO opened this issue Mar 23, 2024 · 5 comments · Fixed by #639
Closed

The filters are not working with metadata that contain a space. #616

PAHXO opened this issue Mar 23, 2024 · 5 comments · Fixed by #639
Assignees
Labels

Comments

@PAHXO
Copy link

PAHXO commented Mar 23, 2024

Greetings.

Elasticsearch retrievers bm25, Embedding, and the filter retriever. Their filters don't select string metadata that a has space within them.

from haystack import Document
from haystack.components.retrievers import FilterRetriever
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

docs = [
    Document(content="Python is a popular programming language",
            meta={"about": "Python language", "language": "english"}),

    Document(content="python ist eine beliebte Programmiersprache",
             meta={"about": "Python language", "language": "german"}),
]

document_store = ElasticsearchDocumentStore(hosts=something)
document_store.write_documents(docs, policy=DuplicatePolicy.OVERWRITE)
filters = {
    "operator": "AND",
    "conditions": [
        {"field": "meta.about", "operator": "in", "value": ["Python language"]},
        {"field": "meta.language", "operator": "in", "value": ["english"]}
    ]}
retriever = FilterRetriever(document_store)
result = retriever.run(filters={"field": "about", "operator": "==", "value": "Python language"})

"it does not work, because 'Python Language' has a space within it."
@PAHXO PAHXO added the bug Something isn't working label Mar 23, 2024
@julian-risch julian-risch added the P1 label Apr 2, 2024
@julian-risch julian-risch self-assigned this Apr 2, 2024
@julian-risch
Copy link
Member

julian-risch commented Apr 2, 2024

Hello @PAHXO thank you for reporting this issue. After having a first look, my understanding is that the issue is caused by using term instead of match on a text field under the hood in the implementation of the integration. The unexpected behavior is not limited to metadata with whitespaces. For example, I have two documents with "Python" as metadata but a filter "Python" will not retrieve them. Only "python". term queries are not analyzed by Elasticsearch, only match queries. That means if the meta field was analyzed, in this example here lowercased, then filters can unexpectedly not match.

return {"match": {field: {"query": value, "minimum_should_match": "100%"}}}
return {"term": {field: value}}

I'll continue looking into it.

from haystack import Document
from haystack.components.retrievers import FilterRetriever
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

docs = [
    Document(content="Python is a popular programming language",
            meta={"about": "Python", "language": "english"}),

    Document(content="python ist eine beliebte Programmiersprache",
             meta={"about": "Python", "language": "german"}),
]

document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200")
document_store.write_documents(docs, policy=DuplicatePolicy.OVERWRITE)
retriever = FilterRetriever(document_store)
result = retriever.run(filters={"field": "about", "operator": "==", "value": "Python"})
print(result["documents"])  # no document retrieved
result = retriever.run(filters={"field": "about", "operator": "==", "value": "python"})
print(result["documents"])  # both documents retrieved

For reference, here is the relevant documentation page from elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html#term-query-notes

@julian-risch
Copy link
Member

One way to fix the behavior would be to adjust the mapping that you use for your elasticsearch index so that the metadata in the about field is not analyzed.

@julian-risch
Copy link
Member

@PAHXO I opened a PR that fixes this issue: #639
Thanks again for bringing this issue to our attention. 👍

@julian-risch
Copy link
Member

@PAHXO The bug is fixed now and there is a new release of elasticsearch-haystack on pypi: https://pypi.org/project/elasticsearch-haystack/
pip install --upgrade elasticsearch-haystack should do the trick.

@PAHXO
Copy link
Author

PAHXO commented Apr 3, 2024

I'll sure try it as soon as I can! Thanks, for taking the time to look into the issue @julian-risch 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants