Azure AI Search, metadata field is required and hardcoded in langchain community #18731

levalencia · 2024-03-07T11:19:46Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Custom Retriever Code

# Code from: https://redis.com/blog/build-ecommerce-chatbot-with-redis/
class UserRetriever(BaseRetriever):

    """
    ArgenxUserRetriever class extends BaseRetriever and is designed for retrieving relevant documents
    based on a user query using hybrid similarity search with a VectorStore.

    Attributes:
    - vectorstore (VectorStore): The VectorStore instance used for similarity search.
    - username (str): The username associated with the documents, used for personalized retrieval.

    Methods:
    - clean_metadata(self, doc): Cleans the metadata of a document, extracting relevant information for display.
    - get_relevant_documents(self, query): Retrieves relevant documents based on a user query using hybrid similarity search.

    Example:
    retriever = ArgenxRetriever(vectorstore=vector_store, username="john_doe")
    relevant_docs = retriever.get_relevant_documents("How does photosynthesis work?")
    for doc in relevant_docs:
        print(doc.metadata["Title"], doc.page_content)
    """

    vectorstore: VectorStore
    username: str

    def clean_metadata(self, doc):
        """
        Cleans the metadata of a document.

        Parameters:
            doc (object): The document object.

        Returns:
            dict: A dictionary containing the cleaned metadata.

        """
        metadata = doc.metadata

        return {
            "file_id": metadata["title"], 
            "source": metadata["title"] + "_page=" + str(int(metadata["chunk_id"].split("_")[-1])+1), 
            "page_number": str(int(metadata["chunk_id"].split("_")[-1])+1), 
            "document_title": metadata["document_title_result"] 
        }

               
    def get_relevant_documents(self, query):
        """
        Retrieves relevant documents based on a given query.

        Args:
            query (str): The query to search for relevant documents.

        Returns:
            list: A list of relevant documents.

        """
        docs = []
        is_match_filter = ""
        load_dotenv()
        admins = os.getenv('ADMINS', '')
        admins_list = admins.split(',')
        is_admin = self.username.split('@')[0] in admins_list

os.environ["AZURESEARCH_FIELDS_ID"] = "chunk_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "chunk"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "vector"
#os.environ["AZURESEARCH_FIELDS_TAG"] = "metadata"

        if not is_admin:
            is_match_filter = f"search.ismatch('{self.username.split('@')[0]}', 'usernames_result')"

        for doc in self.vectorstore.similarity_search(query, search_type="semantic_hybrid", k=NUMBER_OF_CHUNKS_TO_RETURN, filters=is_match_filter):
            cleaned_metadata = self.clean_metadata(doc)
            docs.append(Document(
                page_content=doc.page_content,
                metadata=cleaned_metadata))
            
        print("\n\n----------------DOCUMENTS RETRIEVED------------------\n\n", docs)

        return docs

setup langchain chain,llm

        chat = AzureChatOpenAI(
            azure_endpoint=SHD_AZURE_OPENAI_ENDPOINT,
            openai_api_version="2023-03-15-preview",
            deployment_name=    POL_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
            openai_api_key=SHD_OPENAI_KEY ,
            openai_api_type="Azure",
            model_name=POL_OPENAI_GPT_MODEL_NAME,
            streaming=True,
            callbacks=[ChainStreamHandler(g)],  # Set ChainStreamHandler as callback
            temperature=0)
        
        # Define system and human message prompts
        messages = [
            SystemMessagePromptTemplate.from_template(ANSWER_PROMPT),
            HumanMessagePromptTemplate.from_template("{question} Please answer in html format"),
        ]
        
        # Set up embeddings, vector store, chat prompt, retriever, memory, and chain
        embeddings = setup_embeddings()
        vector_store = setup_vector_store(embeddings)
        chat_prompt = ChatPromptTemplate.from_messages(messages)
        retriever = UserRetriever(vectorstore=vector_store, username=username)
        memory = setup_memory()
        #memory.save_context(chat_history)
        chain = ConversationalRetrievalChain.from_llm(chat, 
            retriever=retriever, 
            memory=memory, 
            verbose=False, 
            combine_docs_chain_kwargs={
                "prompt": chat_prompt, 
                "document_prompt": PromptTemplate(
                    template=DOCUMENT_PROMPT,
                    input_variables=["page_content", "source"]
                )
            }
        )

My fields

Error Message and Stack Trace (if applicable)

Exception has occurred: KeyError
'metadata'

The error is thown in this line:

for doc in self.vectorstore.similarity_search(query, search_type="semantic_hybrid", k=NUMBER_OF_CHUNKS_TO_RETURN, filters=is_match_filter):

When I dig deep in the langchain code, I found this code:

docs = [
            (
                Document(
                    page_content=result.pop(FIELDS_CONTENT),
                    metadata={
                        **(
                            json.loads(result[FIELDS_METADATA])
                            if FIELDS_METADATA in result
                            else {
                                k: v
                                for k, v in result.items()
                                if k != FIELDS_CONTENT_VECTOR
                            }
                        ),
                        **{
                            "captions": {
                                "text": result.get("@search.captions", [{}])[0].text,
                                "highlights": result.get("@search.captions", [{}])[
                                    0
                                ].highlights,
                            }
                            if result.get("@search.captions")
                            else {},
                            "answers": semantic_answers_dict.get(
                                json.loads(result["metadata"]).get("key"),
                                "",
                            ),
                        },
                    },
                ),

As you can see in the last line, its trying to find a metadata field on the search results, which we dont have as our index is customized with our own fields.

I am blaming this line:

langchain/libs/community/langchain_community/vectorstores/azuresearch.py

Line 607 in ced5e7b

json.loads(result["metadata"]).get("key"),

@Skar0 , not sure if this is really a bug, or I missed something in the documentation.

Description

I am trying to use langchain with Azure OpenAI and Azure Search as Vector Store, and a custom retriever. I dont have a metadata field

This was working with a previous project with azure-search-documents==11.4.b09
but in a new project I am trying azure-search-documents ==11.4.0

System Info

langchain==0.1.7
langchain-community==0.0.20
langchain-core==0.1.23
langchain-openai==0.0.6
langchainhub==0.1.14

The text was updated successfully, but these errors were encountered:

Skar0 · 2024-03-11T21:41:56Z

Hello @levalencia 😃

I have taken a look at the code and did some tests with my own index, and it indeed seems like the error you are encountering is due to the following line.

langchain/libs/community/langchain_community/vectorstores/azuresearch.py

Line 607 in ced5e7b

json.loads(result["metadata"]).get("key"),

I have created a PR #18938 with a bit more context on what the bug is, where it comes from, and how I (hopefully) fixed it. It would be nice if you can test and confirm!

thelazydogsback · 2024-03-13T15:57:40Z

I'm running into a similar issue as well.
I also have multiple metadata fields in the index - langchain should not make the assumption that there is only one metadata field, nor hard-code any names.
I expect something like this to work if all of the following fields are in my index:

Document( page_content = "this is the text",
    Title = "DocTitle",
    Category = "Foo",
    MoreMeta1 = {"x:"1, "y":2},
    MoreMeta2 = {"z:"1, "q":2},
)

However in my case all I'm trying to do is add my documents to the index with add_texts or add_documents, and this is when I receive:

The property 'metadata' does not exist on type 'search.documentFields'. Make sure to only use property names that are defined by the type

Should I open a new related issue for this?

thelazydogsback · 2024-03-13T16:52:19Z

The PR you reference is changing from 'metadata' to FIELDS_ID.
I'm pretty new here, but shouldn't this be FIELDS_TAG?

Skar0 · 2024-03-13T18:13:50Z

However in my case all I'm trying to do is add my documents to the index with add_texts or add_documents, and this is when I receive:
The property 'metadata' does not exist on type 'search.documentFields'. Make sure to only use property names that are defined by the type
Should I open a new related issue for this?

Do you create the index using the AzureSearch object ? If so, I think a "metadata" field is created by default in the index definition. You can however decide to define an index yourself

thelazydogsback · 2024-03-13T18:51:43Z

Thanks for the reply.
No, I create the index in a separate pipeline outside of the python code.
I don't have (nor want) one particular privileged field called "metadata" (nor only one field I can override in an env var) - there are several fields in the index which hold different types of metadata that I'd like to populate and search on separately.

paychex-ssmithrand · 2024-03-25T16:12:50Z

Also encountering this issue - and have the same set of requirements as @thelazydogsback

@lz-chen

…#18938) - **Description:** The `semantic_hybrid_search_with_score_and_rerank` method of `AzureSearch` contains a hardcoded field name "metadata" for the document metadata in the Azure AI Search Index. Adding such a field is optional when creating an Azure AI Search Index, as other snippets from `AzureSearch` test for the existence of this field before trying to access it. Furthermore, the metadata field name shouldn't be hardcoded as "metadata" and use the `FIELDS_METADATA` variable that defines this field name instead. In the current implementation, any index without a metadata field named "metadata" will yield an error if a semantic answer is returned by the search in `semantic_hybrid_search_with_score_and_rerank`. - **Issue:** #18731 - **Prior fix to this bug:** This bug was fixed in this PR #15642 by adding a check for the existence of the metadata field named `FIELDS_METADATA` and retrieving a value for the key called "key" in that metadata if it exists. If the field named `FIELDS_METADATA` was not present, an empty string was returned. This fix was removed in this PR #15659 (see ed1ffca). @lz-chen: could you confirm this wasn't intentional? - **New fix to this bug:** I believe there was an oversight in the logic of the fix from [#1564](#15642) which I explain below. The `semantic_hybrid_search_with_score_and_rerank` method creates a dictionary `semantic_answers_dict` with semantic answers returned by the search as follows. https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581 The keys in this dictionary are the unique document ids in the index, if I understand the [documentation of semantic answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers) in Azure AI Search correctly. When the method transforms a search result into a `Document` object, an "answer" key is added to the document's metadata. The value for this "answer" key should be the semantic answer returned by the search from this document, if such an answer is returned. The match between a `Document` object and the semantic answers returned by the search should be done through the unique document id, which is used as a key for the `semantic_answers_dict` dictionary. This id is defined in the search result's field named `FIELDS_ID`. I added a check to avoid any error in case no field named `FIELDS_ID` exists in a search result (which shouldn't happen in theory). A benefit of this approach is that this fix should work whether or not the Azure AI Search Index contains a metadata field. @levalencia could you confirm my analysis and test the fix? @raunakshrivastava7 do you agree with the fix? Thanks for the help!

@lz-chen

…langchain-ai#18938) - **Description:** The `semantic_hybrid_search_with_score_and_rerank` method of `AzureSearch` contains a hardcoded field name "metadata" for the document metadata in the Azure AI Search Index. Adding such a field is optional when creating an Azure AI Search Index, as other snippets from `AzureSearch` test for the existence of this field before trying to access it. Furthermore, the metadata field name shouldn't be hardcoded as "metadata" and use the `FIELDS_METADATA` variable that defines this field name instead. In the current implementation, any index without a metadata field named "metadata" will yield an error if a semantic answer is returned by the search in `semantic_hybrid_search_with_score_and_rerank`. - **Issue:** langchain-ai#18731 - **Prior fix to this bug:** This bug was fixed in this PR langchain-ai#15642 by adding a check for the existence of the metadata field named `FIELDS_METADATA` and retrieving a value for the key called "key" in that metadata if it exists. If the field named `FIELDS_METADATA` was not present, an empty string was returned. This fix was removed in this PR langchain-ai#15659 (see langchain-ai@ed1ffca). @lz-chen: could you confirm this wasn't intentional? - **New fix to this bug:** I believe there was an oversight in the logic of the fix from [langchain-ai#1564](langchain-ai#15642) which I explain below. The `semantic_hybrid_search_with_score_and_rerank` method creates a dictionary `semantic_answers_dict` with semantic answers returned by the search as follows. https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581 The keys in this dictionary are the unique document ids in the index, if I understand the [documentation of semantic answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers) in Azure AI Search correctly. When the method transforms a search result into a `Document` object, an "answer" key is added to the document's metadata. The value for this "answer" key should be the semantic answer returned by the search from this document, if such an answer is returned. The match between a `Document` object and the semantic answers returned by the search should be done through the unique document id, which is used as a key for the `semantic_answers_dict` dictionary. This id is defined in the search result's field named `FIELDS_ID`. I added a check to avoid any error in case no field named `FIELDS_ID` exists in a search result (which shouldn't happen in theory). A benefit of this approach is that this fix should work whether or not the Azure AI Search Index contains a metadata field. @levalencia could you confirm my analysis and test the fix? @raunakshrivastava7 do you agree with the fix? Thanks for the help!

@lz-chen

…langchain-ai#18938) - **Description:** The `semantic_hybrid_search_with_score_and_rerank` method of `AzureSearch` contains a hardcoded field name "metadata" for the document metadata in the Azure AI Search Index. Adding such a field is optional when creating an Azure AI Search Index, as other snippets from `AzureSearch` test for the existence of this field before trying to access it. Furthermore, the metadata field name shouldn't be hardcoded as "metadata" and use the `FIELDS_METADATA` variable that defines this field name instead. In the current implementation, any index without a metadata field named "metadata" will yield an error if a semantic answer is returned by the search in `semantic_hybrid_search_with_score_and_rerank`. - **Issue:** langchain-ai#18731 - **Prior fix to this bug:** This bug was fixed in this PR langchain-ai#15642 by adding a check for the existence of the metadata field named `FIELDS_METADATA` and retrieving a value for the key called "key" in that metadata if it exists. If the field named `FIELDS_METADATA` was not present, an empty string was returned. This fix was removed in this PR langchain-ai#15659 (see langchain-ai@ed1ffca). @lz-chen: could you confirm this wasn't intentional? - **New fix to this bug:** I believe there was an oversight in the logic of the fix from [langchain-ai#1564](langchain-ai#15642) which I explain below. The `semantic_hybrid_search_with_score_and_rerank` method creates a dictionary `semantic_answers_dict` with semantic answers returned by the search as follows. https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581 The keys in this dictionary are the unique document ids in the index, if I understand the [documentation of semantic answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers) in Azure AI Search correctly. When the method transforms a search result into a `Document` object, an "answer" key is added to the document's metadata. The value for this "answer" key should be the semantic answer returned by the search from this document, if such an answer is returned. The match between a `Document` object and the semantic answers returned by the search should be done through the unique document id, which is used as a key for the `semantic_answers_dict` dictionary. This id is defined in the search result's field named `FIELDS_ID`. I added a check to avoid any error in case no field named `FIELDS_ID` exists in a search result (which shouldn't happen in theory). A benefit of this approach is that this fix should work whether or not the Azure AI Search Index contains a metadata field. @levalencia could you confirm my analysis and test the fix? @raunakshrivastava7 do you agree with the fix? Thanks for the help!

@lz-chen

…#18938) - **Description:** The `semantic_hybrid_search_with_score_and_rerank` method of `AzureSearch` contains a hardcoded field name "metadata" for the document metadata in the Azure AI Search Index. Adding such a field is optional when creating an Azure AI Search Index, as other snippets from `AzureSearch` test for the existence of this field before trying to access it. Furthermore, the metadata field name shouldn't be hardcoded as "metadata" and use the `FIELDS_METADATA` variable that defines this field name instead. In the current implementation, any index without a metadata field named "metadata" will yield an error if a semantic answer is returned by the search in `semantic_hybrid_search_with_score_and_rerank`. - **Issue:** #18731 - **Prior fix to this bug:** This bug was fixed in this PR #15642 by adding a check for the existence of the metadata field named `FIELDS_METADATA` and retrieving a value for the key called "key" in that metadata if it exists. If the field named `FIELDS_METADATA` was not present, an empty string was returned. This fix was removed in this PR #15659 (see ed1ffca). @lz-chen: could you confirm this wasn't intentional? - **New fix to this bug:** I believe there was an oversight in the logic of the fix from [#1564](#15642) which I explain below. The `semantic_hybrid_search_with_score_and_rerank` method creates a dictionary `semantic_answers_dict` with semantic answers returned by the search as follows. https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581 The keys in this dictionary are the unique document ids in the index, if I understand the [documentation of semantic answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers) in Azure AI Search correctly. When the method transforms a search result into a `Document` object, an "answer" key is added to the document's metadata. The value for this "answer" key should be the semantic answer returned by the search from this document, if such an answer is returned. The match between a `Document` object and the semantic answers returned by the search should be done through the unique document id, which is used as a key for the `semantic_answers_dict` dictionary. This id is defined in the search result's field named `FIELDS_ID`. I added a check to avoid any error in case no field named `FIELDS_ID` exists in a search result (which shouldn't happen in theory). A benefit of this approach is that this fix should work whether or not the Azure AI Search Index contains a metadata field. @levalencia could you confirm my analysis and test the fix? @raunakshrivastava7 do you agree with the fix? Thanks for the help!

dosubot bot added Ɑ: vector store Related to vector store module 🔌: openai Primarily related to OpenAI integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 7, 2024

Skar0 mentioned this issue Mar 11, 2024

community: fix semantic answer bug in AzureSearch vector store #18938

Merged

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 24, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure AI Search, metadata field is required and hardcoded in langchain community #18731

Azure AI Search, metadata field is required and hardcoded in langchain community #18731

levalencia commented Mar 7, 2024

Skar0 commented Mar 11, 2024

thelazydogsback commented Mar 13, 2024 •

edited

Loading

thelazydogsback commented Mar 13, 2024

Skar0 commented Mar 13, 2024

thelazydogsback commented Mar 13, 2024 •

edited

Loading

paychex-ssmithrand commented Mar 25, 2024

Azure AI Search, metadata field is required and hardcoded in langchain community #18731

Azure AI Search, metadata field is required and hardcoded in langchain community #18731

Comments

levalencia commented Mar 7, 2024

Checked other resources

Example Code

Custom Retriever Code

setup langchain chain,llm

My fields

Error Message and Stack Trace (if applicable)

Description

System Info

Skar0 commented Mar 11, 2024

thelazydogsback commented Mar 13, 2024 • edited Loading

thelazydogsback commented Mar 13, 2024

Skar0 commented Mar 13, 2024

thelazydogsback commented Mar 13, 2024 • edited Loading

paychex-ssmithrand commented Mar 25, 2024

thelazydogsback commented Mar 13, 2024 •

edited

Loading

thelazydogsback commented Mar 13, 2024 •

edited

Loading