Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure AI Search, metadata field is required and hardcoded in langchain community #18731

Closed
5 tasks done
levalencia opened this issue Mar 7, 2024 · 6 comments
Closed
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: openai Primarily related to OpenAI integrations Ɑ: vector store Related to vector store module

Comments

@levalencia
Copy link
Contributor

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Custom Retriever Code

# Code from: https://redis.com/blog/build-ecommerce-chatbot-with-redis/
class UserRetriever(BaseRetriever):

    """
    ArgenxUserRetriever class extends BaseRetriever and is designed for retrieving relevant documents
    based on a user query using hybrid similarity search with a VectorStore.

    Attributes:
    - vectorstore (VectorStore): The VectorStore instance used for similarity search.
    - username (str): The username associated with the documents, used for personalized retrieval.

    Methods:
    - clean_metadata(self, doc): Cleans the metadata of a document, extracting relevant information for display.
    - get_relevant_documents(self, query): Retrieves relevant documents based on a user query using hybrid similarity search.

    Example:
    retriever = ArgenxRetriever(vectorstore=vector_store, username="john_doe")
    relevant_docs = retriever.get_relevant_documents("How does photosynthesis work?")
    for doc in relevant_docs:
        print(doc.metadata["Title"], doc.page_content)
    """

    vectorstore: VectorStore
    username: str

    def clean_metadata(self, doc):
        """
        Cleans the metadata of a document.

        Parameters:
            doc (object): The document object.

        Returns:
            dict: A dictionary containing the cleaned metadata.

        """
        metadata = doc.metadata

        return {
            "file_id": metadata["title"], 
            "source": metadata["title"] + "_page=" + str(int(metadata["chunk_id"].split("_")[-1])+1), 
            "page_number": str(int(metadata["chunk_id"].split("_")[-1])+1), 
            "document_title": metadata["document_title_result"] 
        }

               
    def get_relevant_documents(self, query):
        """
        Retrieves relevant documents based on a given query.

        Args:
            query (str): The query to search for relevant documents.

        Returns:
            list: A list of relevant documents.

        """
        docs = []
        is_match_filter = ""
        load_dotenv()
        admins = os.getenv('ADMINS', '')
        admins_list = admins.split(',')
        is_admin = self.username.split('@')[0] in admins_list

os.environ["AZURESEARCH_FIELDS_ID"] = "chunk_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "chunk"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "vector"
#os.environ["AZURESEARCH_FIELDS_TAG"] = "metadata"

        if not is_admin:
            is_match_filter = f"search.ismatch('{self.username.split('@')[0]}', 'usernames_result')"

        for doc in self.vectorstore.similarity_search(query, search_type="semantic_hybrid", k=NUMBER_OF_CHUNKS_TO_RETURN, filters=is_match_filter):
            cleaned_metadata = self.clean_metadata(doc)
            docs.append(Document(
                page_content=doc.page_content,
                metadata=cleaned_metadata))
            
        print("\n\n----------------DOCUMENTS RETRIEVED------------------\n\n", docs)

        return docs

setup langchain chain,llm

        chat = AzureChatOpenAI(
            azure_endpoint=SHD_AZURE_OPENAI_ENDPOINT,
            openai_api_version="2023-03-15-preview",
            deployment_name=    POL_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
            openai_api_key=SHD_OPENAI_KEY ,
            openai_api_type="Azure",
            model_name=POL_OPENAI_GPT_MODEL_NAME,
            streaming=True,
            callbacks=[ChainStreamHandler(g)],  # Set ChainStreamHandler as callback
            temperature=0)
        
        # Define system and human message prompts
        messages = [
            SystemMessagePromptTemplate.from_template(ANSWER_PROMPT),
            HumanMessagePromptTemplate.from_template("{question} Please answer in html format"),
        ]
        
        # Set up embeddings, vector store, chat prompt, retriever, memory, and chain
        embeddings = setup_embeddings()
        vector_store = setup_vector_store(embeddings)
        chat_prompt = ChatPromptTemplate.from_messages(messages)
        retriever = UserRetriever(vectorstore=vector_store, username=username)
        memory = setup_memory()
        #memory.save_context(chat_history)
        chain = ConversationalRetrievalChain.from_llm(chat, 
            retriever=retriever, 
            memory=memory, 
            verbose=False, 
            combine_docs_chain_kwargs={
                "prompt": chat_prompt, 
                "document_prompt": PromptTemplate(
                    template=DOCUMENT_PROMPT,
                    input_variables=["page_content", "source"]
                )
            }
        )

My fields

image

Error Message and Stack Trace (if applicable)

Exception has occurred: KeyError
'metadata'

The error is thown in this line:

for doc in self.vectorstore.similarity_search(query, search_type="semantic_hybrid", k=NUMBER_OF_CHUNKS_TO_RETURN, filters=is_match_filter):

When I dig deep in the langchain code, I found this code:

docs = [
            (
                Document(
                    page_content=result.pop(FIELDS_CONTENT),
                    metadata={
                        **(
                            json.loads(result[FIELDS_METADATA])
                            if FIELDS_METADATA in result
                            else {
                                k: v
                                for k, v in result.items()
                                if k != FIELDS_CONTENT_VECTOR
                            }
                        ),
                        **{
                            "captions": {
                                "text": result.get("@search.captions", [{}])[0].text,
                                "highlights": result.get("@search.captions", [{}])[
                                    0
                                ].highlights,
                            }
                            if result.get("@search.captions")
                            else {},
                            "answers": semantic_answers_dict.get(
                                json.loads(result["metadata"]).get("key"),
                                "",
                            ),
                        },
                    },
                ),

As you can see in the last line, its trying to find a metadata field on the search results, which we dont have as our index is customized with our own fields.

I am blaming this line:

json.loads(result["metadata"]).get("key"),

@Skar0 , not sure if this is really a bug, or I missed something in the documentation.

Description

I am trying to use langchain with Azure OpenAI and Azure Search as Vector Store, and a custom retriever. I dont have a metadata field

This was working with a previous project with azure-search-documents==11.4.b09
but in a new project I am trying azure-search-documents ==11.4.0

System Info

langchain==0.1.7
langchain-community==0.0.20
langchain-core==0.1.23
langchain-openai==0.0.6
langchainhub==0.1.14

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🔌: openai Primarily related to OpenAI integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 7, 2024
@Skar0
Copy link
Contributor

Skar0 commented Mar 11, 2024

Hello @levalencia 😃

I have taken a look at the code and did some tests with my own index, and it indeed seems like the error you are encountering is due to the following line.

json.loads(result["metadata"]).get("key"),

I have created a PR #18938 with a bit more context on what the bug is, where it comes from, and how I (hopefully) fixed it. It would be nice if you can test and confirm!

@thelazydogsback
Copy link

thelazydogsback commented Mar 13, 2024

I'm running into a similar issue as well.
I also have multiple metadata fields in the index - langchain should not make the assumption that there is only one metadata field, nor hard-code any names.
I expect something like this to work if all of the following fields are in my index:

Document( page_content = "this is the text",
    Title = "DocTitle",
    Category = "Foo",
    MoreMeta1 = {"x:"1, "y":2},
    MoreMeta2 = {"z:"1, "q":2},
)

However in my case all I'm trying to do is add my documents to the index with add_texts or add_documents, and this is when I receive:

The property 'metadata' does not exist on type 'search.documentFields'. Make sure to only use property names that are defined by the type

Should I open a new related issue for this?

@thelazydogsback
Copy link

The PR you reference is changing from 'metadata' to FIELDS_ID.
I'm pretty new here, but shouldn't this be FIELDS_TAG?

@Skar0
Copy link
Contributor

Skar0 commented Mar 13, 2024

However in my case all I'm trying to do is add my documents to the index with add_texts or add_documents, and this is when I receive:

The property 'metadata' does not exist on type 'search.documentFields'. Make sure to only use property names that are defined by the type

Should I open a new related issue for this?

Do you create the index using the AzureSearch object ? If so, I think a "metadata" field is created by default in the index definition. You can however decide to define an index yourself

@thelazydogsback
Copy link

thelazydogsback commented Mar 13, 2024

Thanks for the reply.
No, I create the index in a separate pipeline outside of the python code.
I don't have (nor want) one particular privileged field called "metadata" (nor only one field I can override in an env var) - there are several fields in the index which hold different types of metadata that I'd like to populate and search on separately.

@paychex-ssmithrand
Copy link

Also encountering this issue - and have the same set of requirements as @thelazydogsback

baskaryan pushed a commit that referenced this issue Mar 26, 2024
…#18938)

- **Description:** The `semantic_hybrid_search_with_score_and_rerank`
method of `AzureSearch` contains a hardcoded field name "metadata" for
the document metadata in the Azure AI Search Index. Adding such a field
is optional when creating an Azure AI Search Index, as other snippets
from `AzureSearch` test for the existence of this field before trying to
access it. Furthermore, the metadata field name shouldn't be hardcoded
as "metadata" and use the `FIELDS_METADATA` variable that defines this
field name instead. In the current implementation, any index without a
metadata field named "metadata" will yield an error if a semantic answer
is returned by the search in
`semantic_hybrid_search_with_score_and_rerank`.

- **Issue:** #18731

- **Prior fix to this bug:** This bug was fixed in this PR
#15642 by adding a check
for the existence of the metadata field named `FIELDS_METADATA` and
retrieving a value for the key called "key" in that metadata if it
exists. If the field named `FIELDS_METADATA` was not present, an empty
string was returned. This fix was removed in this PR
#15659 (see
ed1ffca).
@lz-chen: could you confirm this wasn't intentional? 

- **New fix to this bug:** I believe there was an oversight in the logic
of the fix from
[#1564](#15642) which I
explain below.
The `semantic_hybrid_search_with_score_and_rerank` method creates a
dictionary `semantic_answers_dict` with semantic answers returned by the
search as follows.

https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581
The keys in this dictionary are the unique document ids in the index, if
I understand the [documentation of semantic
answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers)
in Azure AI Search correctly. When the method transforms a search result
into a `Document` object, an "answer" key is added to the document's
metadata. The value for this "answer" key should be the semantic answer
returned by the search from this document, if such an answer is
returned. The match between a `Document` object and the semantic answers
returned by the search should be done through the unique document id,
which is used as a key for the `semantic_answers_dict` dictionary. This
id is defined in the search result's field named `FIELDS_ID`. I added a
check to avoid any error in case no field named `FIELDS_ID` exists in a
search result (which shouldn't happen in theory).
A benefit of this approach is that this fix should work whether or not
the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!
gkorland pushed a commit to FalkorDB/langchain that referenced this issue Mar 30, 2024
…langchain-ai#18938)

- **Description:** The `semantic_hybrid_search_with_score_and_rerank`
method of `AzureSearch` contains a hardcoded field name "metadata" for
the document metadata in the Azure AI Search Index. Adding such a field
is optional when creating an Azure AI Search Index, as other snippets
from `AzureSearch` test for the existence of this field before trying to
access it. Furthermore, the metadata field name shouldn't be hardcoded
as "metadata" and use the `FIELDS_METADATA` variable that defines this
field name instead. In the current implementation, any index without a
metadata field named "metadata" will yield an error if a semantic answer
is returned by the search in
`semantic_hybrid_search_with_score_and_rerank`.

- **Issue:** langchain-ai#18731

- **Prior fix to this bug:** This bug was fixed in this PR
langchain-ai#15642 by adding a check
for the existence of the metadata field named `FIELDS_METADATA` and
retrieving a value for the key called "key" in that metadata if it
exists. If the field named `FIELDS_METADATA` was not present, an empty
string was returned. This fix was removed in this PR
langchain-ai#15659 (see
langchain-ai@ed1ffca).
@lz-chen: could you confirm this wasn't intentional? 

- **New fix to this bug:** I believe there was an oversight in the logic
of the fix from
[langchain-ai#1564](langchain-ai#15642) which I
explain below.
The `semantic_hybrid_search_with_score_and_rerank` method creates a
dictionary `semantic_answers_dict` with semantic answers returned by the
search as follows.

https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581
The keys in this dictionary are the unique document ids in the index, if
I understand the [documentation of semantic
answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers)
in Azure AI Search correctly. When the method transforms a search result
into a `Document` object, an "answer" key is added to the document's
metadata. The value for this "answer" key should be the semantic answer
returned by the search from this document, if such an answer is
returned. The match between a `Document` object and the semantic answers
returned by the search should be done through the unique document id,
which is used as a key for the `semantic_answers_dict` dictionary. This
id is defined in the search result's field named `FIELDS_ID`. I added a
check to avoid any error in case no field named `FIELDS_ID` exists in a
search result (which shouldn't happen in theory).
A benefit of this approach is that this fix should work whether or not
the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!
chrispy-snps pushed a commit to chrispy-snps/langchain that referenced this issue Mar 30, 2024
…langchain-ai#18938)

- **Description:** The `semantic_hybrid_search_with_score_and_rerank`
method of `AzureSearch` contains a hardcoded field name "metadata" for
the document metadata in the Azure AI Search Index. Adding such a field
is optional when creating an Azure AI Search Index, as other snippets
from `AzureSearch` test for the existence of this field before trying to
access it. Furthermore, the metadata field name shouldn't be hardcoded
as "metadata" and use the `FIELDS_METADATA` variable that defines this
field name instead. In the current implementation, any index without a
metadata field named "metadata" will yield an error if a semantic answer
is returned by the search in
`semantic_hybrid_search_with_score_and_rerank`.

- **Issue:** langchain-ai#18731

- **Prior fix to this bug:** This bug was fixed in this PR
langchain-ai#15642 by adding a check
for the existence of the metadata field named `FIELDS_METADATA` and
retrieving a value for the key called "key" in that metadata if it
exists. If the field named `FIELDS_METADATA` was not present, an empty
string was returned. This fix was removed in this PR
langchain-ai#15659 (see
langchain-ai@ed1ffca).
@lz-chen: could you confirm this wasn't intentional? 

- **New fix to this bug:** I believe there was an oversight in the logic
of the fix from
[langchain-ai#1564](langchain-ai#15642) which I
explain below.
The `semantic_hybrid_search_with_score_and_rerank` method creates a
dictionary `semantic_answers_dict` with semantic answers returned by the
search as follows.

https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581
The keys in this dictionary are the unique document ids in the index, if
I understand the [documentation of semantic
answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers)
in Azure AI Search correctly. When the method transforms a search result
into a `Document` object, an "answer" key is added to the document's
metadata. The value for this "answer" key should be the semantic answer
returned by the search from this document, if such an answer is
returned. The match between a `Document` object and the semantic answers
returned by the search should be done through the unique document id,
which is used as a key for the `semantic_answers_dict` dictionary. This
id is defined in the search result's field named `FIELDS_ID`. I added a
check to avoid any error in case no field named `FIELDS_ID` exists in a
search result (which shouldn't happen in theory).
A benefit of this approach is that this fix should work whether or not
the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!
hinthornw pushed a commit that referenced this issue Apr 26, 2024
…#18938)

- **Description:** The `semantic_hybrid_search_with_score_and_rerank`
method of `AzureSearch` contains a hardcoded field name "metadata" for
the document metadata in the Azure AI Search Index. Adding such a field
is optional when creating an Azure AI Search Index, as other snippets
from `AzureSearch` test for the existence of this field before trying to
access it. Furthermore, the metadata field name shouldn't be hardcoded
as "metadata" and use the `FIELDS_METADATA` variable that defines this
field name instead. In the current implementation, any index without a
metadata field named "metadata" will yield an error if a semantic answer
is returned by the search in
`semantic_hybrid_search_with_score_and_rerank`.

- **Issue:** #18731

- **Prior fix to this bug:** This bug was fixed in this PR
#15642 by adding a check
for the existence of the metadata field named `FIELDS_METADATA` and
retrieving a value for the key called "key" in that metadata if it
exists. If the field named `FIELDS_METADATA` was not present, an empty
string was returned. This fix was removed in this PR
#15659 (see
ed1ffca).
@lz-chen: could you confirm this wasn't intentional? 

- **New fix to this bug:** I believe there was an oversight in the logic
of the fix from
[#1564](#15642) which I
explain below.
The `semantic_hybrid_search_with_score_and_rerank` method creates a
dictionary `semantic_answers_dict` with semantic answers returned by the
search as follows.

https://github.com/langchain-ai/langchain/blob/5c2f7e6b2b474248af63a5f0f726b1414c5467c8/libs/community/langchain_community/vectorstores/azuresearch.py#L574-L581
The keys in this dictionary are the unique document ids in the index, if
I understand the [documentation of semantic
answers](https://learn.microsoft.com/en-us/azure/search/semantic-answers)
in Azure AI Search correctly. When the method transforms a search result
into a `Document` object, an "answer" key is added to the document's
metadata. The value for this "answer" key should be the semantic answer
returned by the search from this document, if such an answer is
returned. The match between a `Document` object and the semantic answers
returned by the search should be done through the unique document id,
which is used as a key for the `semantic_answers_dict` dictionary. This
id is defined in the search result's field named `FIELDS_ID`. I added a
check to avoid any error in case no field named `FIELDS_ID` exists in a
search result (which shouldn't happen in theory).
A benefit of this approach is that this fix should work whether or not
the Azure AI Search Index contains a metadata field.

@levalencia could you confirm my analysis and test the fix?
@raunakshrivastava7 do you agree with the fix?

Thanks for the help!
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 24, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: openai Primarily related to OpenAI integrations Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

4 participants