feat: MongoDB Atlas keyword search #1200

kanenorman · 2024-11-19T01:08:39Z

Related Issues

fixes Full-text MongoDBAtlas Retriever #872
please review in favor of feat: Support Mongodb full text search #1140

Proposed Changes:

Added MongoDBAtlasDocumentStore._fulltext_retrieval() and the corresponding MongoDBAtlasFullTextRetriever component, enabling users to leverage MongoDB Atlas's full-text search capabilities.

How did you test it?

Updated existing test to handle new init parameter full_text_search_index
Added test for MongoDBAtlasFullTextRetriever component to test_retriever.py

I did not implement a test for full-text retrieval. Similar to test_embedding_retrieval.py, it would likely be more secure and maintainable for a member of the Haystack team to write the test for a collection created in Haystack's MongoDB Atlas cluster, specifically in a file called test_fulltext_retrieval.py.. (I would also be willing to write test if that is preferred).

Notes for the reviewer

In MongoDBAtlasDocumentStore._fulltext_retrieval(), filters are currently passed directly to the $match stage, similar to how Langchain implements this behavior. While this approach works, it is not optimal for performance, as outlined in the MongoDB Atlas documentation. A more efficient solution would involve adding a filter clause to the compound operator. However, this requires additional work since the filter syntax used in the $match stage is not directly compatible with the $filter clause. Implementing this would require writing a new helper function, _normalize_compound_filter, to convert Haystack filter syntax into the format understood by MongoDB’s compound filter. Given that this would significantly expand the size & scope of this PR, I believe it is more appropriate to address this enhancement in a separate PR.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

kanenorman · 2024-11-19T05:20:55Z

I'm encountering a peculiar configuration error in CI when running this line of code in the test:

connection: MongoClient = MongoClient(
    os.environ["MONGO_CONNECTION_STRING"], 
    driver=DriverInfo(name="MongoDBAtlasHaystackIntegration")
)

The error is as follows:

/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/synchronous/mongo_client.py:797: in __init__
    seeds.update(uri_parser.split_hosts(entity, port))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hosts = '', default_port = 27017

    def split_hosts(hosts: str, default_port: Optional[int] = DEFAULT_PORT) -> list[_Address]:
        """Takes a string of the form host1[:port],host2[:port]...
        Splits it into (host, port) tuples. If [:port] isn't present, the
        default_port is used.
        ...
        """
        nodes = []
        for entity in hosts.split(","):
            if not entity:
>               raise ConfigurationError("Empty host (or extra comma in host list).")
E               pymongo.errors.ConfigurationError: Empty host (or extra comma in host list).

/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/uri_parser.py:392: ConfigurationError

This issue occurs only on Mac/Linux CI but not on Windows. Curiously, I cannot reproduce the failure locally on my Mac. I'll continue investigating the root cause of this inconsistency in the CI environment.

EDIT:
@mpangrazzi. Realized I won't have access to {{ secrets.MONGO_CONNECTION_STRING }} in the CI Workflow from a public fork.

anakin87 · 2024-11-21T09:51:21Z

Hey @kanenorman, thanks for the contribution and sorry for the delay...

I will try to have a look in the next few days!

vblagoje · 2024-11-25T13:12:35Z

@kanenorman my colleague @mpangrazzi and I will take over the review of this PR.

This PR already looks good. Before we dive into details, would you please:

add test_fulltext_retrieval.py as you suggested
if possible a new example for this feature in examples dir

Can we somehow use both text-search and vector indices? If so an example of both retrieval setups for our users would be great. If this is possible perhaps @mpangrazzi and I can work on this example.

Regarding CI - you are right. Let me see internally what we can do here short of migrating your branch to our main fork and two of us doing the final polishing, checks and the final integration.

kanenorman · 2024-11-25T22:49:06Z

@vblagoje - Thank you! I'll address the two bullet points you mentioned above.

Also,

Can we somehow use both text-search and vector indices? If so, an example of both retrieval setups for our users would be great. If this is possible, perhaps @mpangrazzi and I can work on this example.

For this, are you requesting to see a hybrid retrieval scenario where we demonstrate:

MongoDBAtlasFullTextRetriever utilizing MongoDBAtlasDocumentStore.full_text_search_index
MongoDBAtlasEmbeddingRetriever utilizing MongoDBAtlasDocumentStore.vector_search_index
Similar to the pgvector hybrid example

vblagoje · 2024-11-26T14:38:10Z

Yes @kanenorman - it would be nice for the community to see this example and for all of us to easily verify we nailed this one. 🙏

kanenorman · 2024-12-01T03:25:41Z

@vblagoje - I've added integrations/mongodb_atlas/examples/hybrid_retrieval.py to demonstrate how MongoDBAtlasFullTextRetriever and MongoDBAtlasEmbeddingRetriever can be used together to create a hybrid retrieval pipeline.

For the full-text search test in test_fulltext_retrieval.py, I completed the following setup:

Created the full_text_index in MongoDB atlas UI
Inserted test documents into the document store:

docs = [
    Document(
        content="The quick brown fox chased the dog",
        meta={"meta_field": "right_value"}
    ),
    Document(
        content="The fox was brown",
        meta={"meta_field": "right_value"}
     ),
    Document(content="The lazy dog"),
    Document(content="fox fox fox"),
]
document_store.write_documents(docs)

Added a synonym mapping for testing:

{"mappingType":"equivalent","synonyms":["fox","reynard"]}

Please let me know if you need any additional details or assistance setting up the test environment.

mpangrazzi · 2024-12-02T08:37:09Z

@kanenorman thanks for your work! 🙏 I'll review with @vblagoje soon!

CLAassistant · 2024-12-03T16:10:56Z

All committers have signed the CLA.

mpangrazzi · 2024-12-03T16:50:46Z

Hi @kanenorman! We've reviewed the code and opened this (on our branches we have no secrets issue and tests are running).

We updated test_fulltext_retrieval.py according to your comment, but it seems that some tests are still failing (see here).

Are they running correctly on your side? How have you created the full_text_index? Thanks!

kanenorman · 2024-12-03T17:43:30Z

Hi, @mpangrazzi. I apologize for not including this in my previous comment.

This is the index that I created where fox_synonyms is a collection containing the synonym mapping I mentioned in the previous comment. The test passed on my local machine when I ran with these configurations. please let me know if there are further issues.

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "content": [
        {
          "type": "string"
        }
      ]
    }
  },
  "synonyms": [
    {
      "analyzer": "lucene.standard",
      "name": "synonym_mapping",
      "source": {
        "collection": "fox_synonyms"
      }
    }
  ]
}

mpangrazzi · 2024-12-04T10:36:01Z

@kanenorman Thanks for your answer! We have everything working correctly here. We'll do a final review soon.

mpangrazzi · 2024-12-05T13:22:55Z

@kanenorman Closing this since #1228 has been merged!

kanenorman added 4 commits November 18, 2024 16:10

feat: add full-text search capability

56e4783

feat: add full-text retriever

50c41a6

docs: update docs for mongodb atlas indexes

fcec092

docs: update usage example for document store

9e73b9e

kanenorman requested a review from a team as a code owner November 19, 2024 01:08

kanenorman requested review from mpangrazzi and removed request for a team November 19, 2024 01:08

github-actions bot added integration:mongodb-atlas type:documentation Improvements or additions to documentation labels Nov 19, 2024

kanenorman changed the title ~~Mongodb keyword search~~ feat: MongoDB Atlas keyword search Nov 19, 2024

Merge branch 'deepset-ai:main' into mongodb-keyword-search

aa4ae2b

anakin87 self-requested a review November 21, 2024 09:50

anakin87 assigned anakin87 and unassigned anakin87 Nov 21, 2024

julian-risch mentioned this pull request Nov 21, 2024

feat: Support Mongodb full text search #1140

Closed

vblagoje requested review from vblagoje and removed request for anakin87 November 21, 2024 14:14

Merge branch 'main' into mongodb-keyword-search

aaa59ae

kanenorman and others added 5 commits November 28, 2024 16:40

feat: update embedding retrieval example

bbf68c0

feat: add hybrid retrieval example

aa583e5

fix: correct typo for parameter name

9394e23

test: add full-text retrieval test

f20437a

Merge branch 'main' into mongodb-keyword-search

d00f57c

test: add test for full-text aggregation pipeline

ac475f3

tested examples ; minor refactor adding prints

2b3fad2

mpangrazzi mentioned this pull request Dec 3, 2024

Mongodb keyword search #1227

Closed

mpangrazzi added 2 commits December 3, 2024 09:49

fix lint

0da8899

fix test

3fba143

mpangrazzi mentioned this pull request Dec 3, 2024

Mongodb keyword search #1228

Merged

mpangrazzi closed this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MongoDB Atlas keyword search #1200

feat: MongoDB Atlas keyword search #1200

kanenorman commented Nov 19, 2024 •

edited

Loading

kanenorman commented Nov 19, 2024 •

edited

Loading

anakin87 commented Nov 21, 2024

vblagoje commented Nov 25, 2024

kanenorman commented Nov 25, 2024

vblagoje commented Nov 26, 2024 •

edited

Loading

kanenorman commented Dec 1, 2024

mpangrazzi commented Dec 2, 2024

CLAassistant commented Dec 3, 2024 •

edited

Loading

mpangrazzi commented Dec 3, 2024 •

edited

Loading

kanenorman commented Dec 3, 2024 •

edited

Loading

mpangrazzi commented Dec 4, 2024

mpangrazzi commented Dec 5, 2024

feat: MongoDB Atlas keyword search #1200

feat: MongoDB Atlas keyword search #1200

Conversation

kanenorman commented Nov 19, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

kanenorman commented Nov 19, 2024 • edited Loading

anakin87 commented Nov 21, 2024

vblagoje commented Nov 25, 2024

kanenorman commented Nov 25, 2024

vblagoje commented Nov 26, 2024 • edited Loading

kanenorman commented Dec 1, 2024

mpangrazzi commented Dec 2, 2024

CLAassistant commented Dec 3, 2024 • edited Loading

mpangrazzi commented Dec 3, 2024 • edited Loading

kanenorman commented Dec 3, 2024 • edited Loading

mpangrazzi commented Dec 4, 2024

mpangrazzi commented Dec 5, 2024

kanenorman commented Nov 19, 2024 •

edited

Loading

kanenorman commented Nov 19, 2024 •

edited

Loading

vblagoje commented Nov 26, 2024 •

edited

Loading

CLAassistant commented Dec 3, 2024 •

edited

Loading

mpangrazzi commented Dec 3, 2024 •

edited

Loading

kanenorman commented Dec 3, 2024 •

edited

Loading