Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: fix semantic answer bug in AzureSearch vector store #18938

Merged
merged 1 commit into from
Mar 26, 2024
Merged

community: fix semantic answer bug in AzureSearch vector store #18938

merged 1 commit into from
Mar 26, 2024

Conversation

Skar0
Copy link
Contributor

@Skar0 Skar0 commented Mar 11, 2024

  • Description: The semantic_hybrid_search_with_score_and_rerank method of AzureSearch contains a hardcoded field name "metadata" for the document metadata in the Azure AI Search Index. Adding such a field is optional when creating an Azure AI Search Index, as other snippets from AzureSearch test for the existence of this field before trying to access it. Furthermore, the metadata field name shouldn't be hardcoded as "metadata" and use the FIELDS_METADATA variable that defines this field name instead. In the current implementation, any index without a metadata field named "metadata" will yield an error if a semantic answer is returned by the search in semantic_hybrid_search_with_score_and_rerank.

  • Issue: Azure AI Search, metadata field is required and hardcoded in langchain community #18731

  • Prior fix to this bug: This bug was fixed in this PR community: Replaced hardcoded "metadata" with FIELDS_METADATA variable in semantic_hybrid_search_with_score_and_rerank #15642 by adding a check for the existence of the metadata field named FIELDS_METADATA and retrieving a value for the key called "key" in that metadata if it exists. If the field named FIELDS_METADATA was not present, an empty string was returned. This fix was removed in this PR community: update AzureSearch class to work with azure-search-documents=11.4.0 #15659 (see ed1ffca).
    @lz-chen: could you confirm this wasn't intentional?

  • New fix to this bug: I believe there was an oversight in the logic of the fix from #1564 which I explain below.
    The semantic_hybrid_search_with_score_and_rerank method creates a dictionary semantic_answers_dict with semantic answers returned by the search as follows.

    # Get Semantic Answers
    semantic_answers = results.get_answers() or []
    semantic_answers_dict: Dict = {}
    for semantic_answer in semantic_answers:
    semantic_answers_dict[semantic_answer.key] = {
    "text": semantic_answer.text,
    "highlights": semantic_answer.highlights,
    }

    The keys in this dictionary are the unique document ids in the index, if I understand the documentation of semantic answers in Azure AI Search correctly. When the method transforms a search result into a Document object, an "answer" key is added to the document's metadata. The value for this "answer" key should be the semantic answer returned by the search from this document, if such an answer is returned. The match between a Document object and the semantic answers returned by the search should be done through the unique document id, which is used as a key for the semantic_answers_dict dictionary. This id is defined in the search result's field named FIELDS_ID. I added a check to avoid any error in case no field named FIELDS_ID exists in a search result (which shouldn't happen in theory).
    A benefit of this approach is that this fix should work whether or not the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!

…ield named 'key' from metadata in semantic_hybrid_search_with_score
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 11, 2024
Copy link

vercel bot commented Mar 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Mar 11, 2024 9:41pm

@levalencia
Copy link
Contributor

not sure when I will have time to test it, but just to be sure, which version was this PR released?

@Skar0
Copy link
Contributor Author

Skar0 commented Mar 18, 2024

Do you mean which azure-search-documents version?

@Skar0
Copy link
Contributor Author

Skar0 commented Mar 18, 2024

@efriis should I perform any other actions to validate the changes in this PR? I'm not sure how to find people to review such a detail (although it's a real bug).

@baskaryan baskaryan merged commit a6cbb75 into langchain-ai:master Mar 26, 2024
59 checks passed
gkorland pushed a commit to FalkorDB/langchain that referenced this pull request Mar 30, 2024
…langchain-ai#18938)

- **Description:** The `semantic_hybrid_search_with_score_and_rerank`
method of `AzureSearch` contains a hardcoded field name "metadata" for
the document metadata in the Azure AI Search Index. Adding such a field
is optional when creating an Azure AI Search Index, as other snippets
from `AzureSearch` test for the existence of this field before trying to
access it. Furthermore, the metadata field name shouldn't be hardcoded
as "metadata" and use the `FIELDS_METADATA` variable that defines this
field name instead. In the current implementation, any index without a
metadata field named "metadata" will yield an error if a semantic answer
is returned by the search in
`semantic_hybrid_search_with_score_and_rerank`.

- **Issue:** langchain-ai#18731

- **Prior fix to this bug:** This bug was fixed in this PR
langchain-ai#15642 by adding a check
for the existence of the metadata field named `FIELDS_METADATA` and
retrieving a value for the key called "key" in that metadata if it
exists. If the field named `FIELDS_METADATA` was not present, an empty
string was returned. This fix was removed in this PR
langchain-ai#15659 (see
langchain-ai@ed1ffca).
@lz-chen: could you confirm this wasn't intentional? 

- **New fix to this bug:** I believe there was an oversight in the logic
of the fix from
[langchain-ai#1564](langchain-ai#15642) which I
explain below.
The `semantic_hybrid_search_with_score_and_rerank` method creates a
dictionary `semantic_answers_dict` with semantic answers returned by the
search as follows.

https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581
The keys in this dictionary are the unique document ids in the index, if
I understand the [documentation of semantic
answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers)
in Azure AI Search correctly. When the method transforms a search result
into a `Document` object, an "answer" key is added to the document's
metadata. The value for this "answer" key should be the semantic answer
returned by the search from this document, if such an answer is
returned. The match between a `Document` object and the semantic answers
returned by the search should be done through the unique document id,
which is used as a key for the `semantic_answers_dict` dictionary. This
id is defined in the search result's field named `FIELDS_ID`. I added a
check to avoid any error in case no field named `FIELDS_ID` exists in a
search result (which shouldn't happen in theory).
A benefit of this approach is that this fix should work whether or not
the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!
chrispy-snps pushed a commit to chrispy-snps/langchain that referenced this pull request Mar 30, 2024
…langchain-ai#18938)

- **Description:** The `semantic_hybrid_search_with_score_and_rerank`
method of `AzureSearch` contains a hardcoded field name "metadata" for
the document metadata in the Azure AI Search Index. Adding such a field
is optional when creating an Azure AI Search Index, as other snippets
from `AzureSearch` test for the existence of this field before trying to
access it. Furthermore, the metadata field name shouldn't be hardcoded
as "metadata" and use the `FIELDS_METADATA` variable that defines this
field name instead. In the current implementation, any index without a
metadata field named "metadata" will yield an error if a semantic answer
is returned by the search in
`semantic_hybrid_search_with_score_and_rerank`.

- **Issue:** langchain-ai#18731

- **Prior fix to this bug:** This bug was fixed in this PR
langchain-ai#15642 by adding a check
for the existence of the metadata field named `FIELDS_METADATA` and
retrieving a value for the key called "key" in that metadata if it
exists. If the field named `FIELDS_METADATA` was not present, an empty
string was returned. This fix was removed in this PR
langchain-ai#15659 (see
langchain-ai@ed1ffca).
@lz-chen: could you confirm this wasn't intentional? 

- **New fix to this bug:** I believe there was an oversight in the logic
of the fix from
[langchain-ai#1564](langchain-ai#15642) which I
explain below.
The `semantic_hybrid_search_with_score_and_rerank` method creates a
dictionary `semantic_answers_dict` with semantic answers returned by the
search as follows.

https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581
The keys in this dictionary are the unique document ids in the index, if
I understand the [documentation of semantic
answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers)
in Azure AI Search correctly. When the method transforms a search result
into a `Document` object, an "answer" key is added to the document's
metadata. The value for this "answer" key should be the semantic answer
returned by the search from this document, if such an answer is
returned. The match between a `Document` object and the semantic answers
returned by the search should be done through the unique document id,
which is used as a key for the `semantic_answers_dict` dictionary. This
id is defined in the search result's field named `FIELDS_ID`. I added a
check to avoid any error in case no field named `FIELDS_ID` exists in a
search result (which shouldn't happen in theory).
A benefit of this approach is that this fix should work whether or not
the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!
hinthornw pushed a commit that referenced this pull request Apr 26, 2024
…#18938)

- **Description:** The `semantic_hybrid_search_with_score_and_rerank`
method of `AzureSearch` contains a hardcoded field name "metadata" for
the document metadata in the Azure AI Search Index. Adding such a field
is optional when creating an Azure AI Search Index, as other snippets
from `AzureSearch` test for the existence of this field before trying to
access it. Furthermore, the metadata field name shouldn't be hardcoded
as "metadata" and use the `FIELDS_METADATA` variable that defines this
field name instead. In the current implementation, any index without a
metadata field named "metadata" will yield an error if a semantic answer
is returned by the search in
`semantic_hybrid_search_with_score_and_rerank`.

- **Issue:** #18731

- **Prior fix to this bug:** This bug was fixed in this PR
#15642 by adding a check
for the existence of the metadata field named `FIELDS_METADATA` and
retrieving a value for the key called "key" in that metadata if it
exists. If the field named `FIELDS_METADATA` was not present, an empty
string was returned. This fix was removed in this PR
#15659 (see
ed1ffca).
@lz-chen: could you confirm this wasn't intentional? 

- **New fix to this bug:** I believe there was an oversight in the logic
of the fix from
[#1564](#15642) which I
explain below.
The `semantic_hybrid_search_with_score_and_rerank` method creates a
dictionary `semantic_answers_dict` with semantic answers returned by the
search as follows.

https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581
The keys in this dictionary are the unique document ids in the index, if
I understand the [documentation of semantic
answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers)
in Azure AI Search correctly. When the method transforms a search result
into a `Document` object, an "answer" key is added to the document's
metadata. The value for this "answer" key should be the semantic answer
returned by the search from this document, if such an answer is
returned. The match between a `Document` object and the semantic answers
returned by the search should be done through the unique document id,
which is used as a key for the `semantic_answers_dict` dictionary. This
id is defined in the search result's field named `FIELDS_ID`. I added a
check to avoid any error in case no field named `FIELDS_ID` exists in a
search result (which shouldn't happen in theory).
A benefit of this approach is that this fix should work whether or not
the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants