-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: MongoDB Atlas keyword search #1200
Conversation
I'm encountering a peculiar configuration error in CI when running this line of code in the test: connection: MongoClient = MongoClient(
os.environ["MONGO_CONNECTION_STRING"],
driver=DriverInfo(name="MongoDBAtlasHaystackIntegration")
) The error is as follows: /home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/synchronous/mongo_client.py:797: in __init__
seeds.update(uri_parser.split_hosts(entity, port))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
hosts = '', default_port = 27017
def split_hosts(hosts: str, default_port: Optional[int] = DEFAULT_PORT) -> list[_Address]:
"""Takes a string of the form host1[:port],host2[:port]...
Splits it into (host, port) tuples. If [:port] isn't present, the
default_port is used.
...
"""
nodes = []
for entity in hosts.split(","):
if not entity:
> raise ConfigurationError("Empty host (or extra comma in host list).")
E pymongo.errors.ConfigurationError: Empty host (or extra comma in host list).
/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/uri_parser.py:392: ConfigurationError This issue occurs only on Mac/Linux CI but not on Windows. Curiously, I cannot reproduce the failure locally on my Mac. I'll continue investigating the root cause of this inconsistency in the CI environment. EDIT: |
Hey @kanenorman, thanks for the contribution and sorry for the delay... I will try to have a look in the next few days! |
@kanenorman my colleague @mpangrazzi and I will take over the review of this PR. This PR already looks good. Before we dive into details, would you please:
Can we somehow use both text-search and vector indices? If so an example of both retrieval setups for our users would be great. If this is possible perhaps @mpangrazzi and I can work on this example. Regarding CI - you are right. Let me see internally what we can do here short of migrating your branch to our main fork and two of us doing the final polishing, checks and the final integration. |
@vblagoje - Thank you! I'll address the two bullet points you mentioned above. Also,
For this, are you requesting to see a hybrid retrieval scenario where we demonstrate:
|
Yes @kanenorman - it would be nice for the community to see this example and for all of us to easily verify we nailed this one. 🙏 |
@vblagoje - I've added For the full-text search test in
docs = [
Document(
content="The quick brown fox chased the dog",
meta={"meta_field": "right_value"}
),
Document(
content="The fox was brown",
meta={"meta_field": "right_value"}
),
Document(content="The lazy dog"),
Document(content="fox fox fox"),
]
document_store.write_documents(docs)
{"mappingType":"equivalent","synonyms":["fox","reynard"]} Please let me know if you need any additional details or assistance setting up the test environment. |
@kanenorman thanks for your work! 🙏 I'll review with @vblagoje soon! |
Hi @kanenorman! We've reviewed the code and opened this (on our branches we have no secrets issue and tests are running). We updated test_fulltext_retrieval.py according to your comment, but it seems that some tests are still failing (see here). Are they running correctly on your side? How have you created the |
Hi, @mpangrazzi. I apologize for not including this in my previous comment. This is the index that I created where {
"mappings": {
"dynamic": false,
"fields": {
"content": [
{
"type": "string"
}
]
}
},
"synonyms": [
{
"analyzer": "lucene.standard",
"name": "synonym_mapping",
"source": {
"collection": "fox_synonyms"
}
}
]
} |
@kanenorman Thanks for your answer! We have everything working correctly here. We'll do a final review soon. |
@kanenorman Closing this since #1228 has been merged! |
Related Issues
Proposed Changes:
Added
MongoDBAtlasDocumentStore._fulltext_retrieval()
and the correspondingMongoDBAtlasFullTextRetriever
component, enabling users to leverage MongoDB Atlas's full-text search capabilities.How did you test it?
full_text_search_index
MongoDBAtlasFullTextRetriever
component totest_retriever.py
I did not implement a test for full-text retrieval. Similar to
test_embedding_retrieval.py
, it would likely be more secure and maintainable for a member of the Haystack team to write the test for a collection created in Haystack's MongoDB Atlas cluster, specifically in a file calledtest_fulltext_retrieval.py.
. (I would also be willing to write test if that is preferred).Notes for the reviewer
In
MongoDBAtlasDocumentStore._fulltext_retrieval()
, filters are currently passed directly to the$match
stage, similar to how Langchain implements this behavior. While this approach works, it is not optimal for performance, as outlined in the MongoDB Atlas documentation. A more efficient solution would involve adding afilter
clause to thecompound
operator. However, this requires additional work since the filter syntax used in the$match
stage is not directly compatible with the$filter
clause. Implementing this would require writing a new helper function,_normalize_compound_filter
, to convert Haystack filter syntax into the format understood by MongoDB’s compound filter. Given that this would significantly expand the size & scope of this PR, I believe it is more appropriate to address this enhancement in a separate PR.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.