Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metadata extractor based on a LLM #92

Merged
merged 56 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
c0d128c
initial import
davidsbatista Sep 13, 2024
83ee863
adding tests
davidsbatista Sep 13, 2024
8687071
adding docstrings
davidsbatista Sep 13, 2024
0c25951
handlint liting
davidsbatista Sep 13, 2024
2569584
fixing tests
davidsbatista Sep 13, 2024
598904d
improving live run test
davidsbatista Sep 13, 2024
eb0c893
fixing docstring
davidsbatista Sep 13, 2024
435260f
fixing tests
davidsbatista Sep 13, 2024
77a5808
fixing tests
davidsbatista Sep 13, 2024
c239c6c
fixing tests
davidsbatista Sep 13, 2024
35b948d
fixing tests
davidsbatista Sep 13, 2024
c561664
PR reviews/comments
davidsbatista Sep 17, 2024
110ab02
linting
davidsbatista Sep 17, 2024
1615139
fixing some tests
davidsbatista Sep 17, 2024
c7bf4c8
using renamed util function
davidsbatista Sep 17, 2024
a03b0cc
adding dependencies for tests
davidsbatista Sep 18, 2024
71655c9
fixing generators dependencies
davidsbatista Sep 18, 2024
117f4fb
fixing types
davidsbatista Sep 18, 2024
8150f7b
reverting function name until PR is merged on haystack core
davidsbatista Sep 18, 2024
3d64426
reverting function name until PR is merged on haystack core
davidsbatista Sep 18, 2024
c221fe4
fixing serialization tests
davidsbatista Sep 18, 2024
504e832
adding pydocs
davidsbatista Sep 18, 2024
83345ce
Update docs/pydoc/config/extractors_api.yml
davidsbatista Sep 18, 2024
bae4b32
Update docs/pydoc/config/extractors_api.yml
davidsbatista Sep 18, 2024
e4fbd0c
refactoring handling the supported LLMs
davidsbatista Sep 20, 2024
30d2879
Merge branch 'main' into add-metadata-extractor
davidsbatista Sep 20, 2024
5241057
missing comma in init
davidsbatista Sep 20, 2024
80bb0b7
Merge branch 'main' into add-metadata-extractor
davidsbatista Sep 24, 2024
db3ee37
fixing README
davidsbatista Sep 24, 2024
d6deda3
fixing README
davidsbatista Sep 24, 2024
4f42b33
Merge branch 'main' into add-metadata-extractor
davidsbatista Sep 24, 2024
502a45c
chaning sede approach, saving all the related LLM params
davidsbatista Sep 25, 2024
0ed8692
Merge branch 'main' into add-metadata-extractor
davidsbatista Sep 25, 2024
40f784d
reverting example notebooks
davidsbatista Sep 25, 2024
cc14b44
forcing OpenAI model version in tests
davidsbatista Sep 25, 2024
185fa90
disabling too-many-arguments for class
davidsbatista Sep 25, 2024
f6679f7
Update haystack_experimental/components/extractors/llm_metadata_extra…
davidsbatista Sep 26, 2024
99e9a9e
adding check prompt to init
davidsbatista Sep 26, 2024
74ff3b6
updating tests
davidsbatista Sep 26, 2024
2e74dc4
Update README.md
davidsbatista Sep 26, 2024
ba51ca1
Update haystack_experimental/components/extractors/llm_metadata_extra…
davidsbatista Sep 26, 2024
3e4b28e
Update haystack_experimental/components/extractors/llm_metadata_extra…
davidsbatista Sep 26, 2024
237eb72
Update haystack_experimental/components/extractors/llm_metadata_extra…
davidsbatista Sep 26, 2024
13030b7
fixes
davidsbatista Sep 26, 2024
46af362
fixing linted files
davidsbatista Sep 26, 2024
a826f11
fixing
davidsbatista Sep 26, 2024
b999b5b
Merge branch 'main' into add-metadata-extractor
davidsbatista Sep 26, 2024
423f489
removing tuples from the output two aligned lists
davidsbatista Sep 27, 2024
194a3fd
Merge branch 'main' into add-metadata-extractor
davidsbatista Sep 27, 2024
0da5082
removed unused import
davidsbatista Sep 27, 2024
a9fa803
chaning errors to a dictionary
davidsbatista Sep 27, 2024
1973bd3
...
davidsbatista Sep 27, 2024
359e1aa
fixing LLMProvider
davidsbatista Sep 27, 2024
693a09c
Update haystack_experimental/components/extractors/llm_metadata_extra…
davidsbatista Sep 30, 2024
8cfb58f
more fixes
davidsbatista Sep 30, 2024
bf6adf5
fixing linting issue
davidsbatista Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The latest version of the package contains the following experiments:
| [`ChatMessageRetriever`][6] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss](https://github.com/deepset-ai/haystack-experimental/discussions/75) |
| [`InMemoryChatMessageStore`][7] | Memory Store | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss](https://github.com/deepset-ai/haystack-experimental/discussions/75) |
| [`Auto-Merging Retriever`][8] & [`HierarchicalDocumentSplitter`][9]| Document Splitting & Retrieval Technique | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/auto_merging_retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss](https://github.com/deepset-ai/haystack-experimental/discussions/78) |
| [`LLMetadataExtractor`][13] | Metadata extraction with LLM | Dezember 2025 | None | | |
| [`LLMetadataExtractor`][13] | Metadata extraction with LLM | December 2024 | None | | |

[1]: https://github.com/deepset-ai/haystack-experimental/tree/main/haystack_experimental/evaluation/harness
[2]: https://github.com/deepset-ai/haystack-experimental/tree/main/haystack_experimental/components/tools/openai
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,10 @@
# SPDX-License-Identifier: Apache-2.0

import json
import logging
from enum import Enum
from typing import Any, Dict, List, Optional, Tuple, Union
from warnings import warn

from haystack import Document, component, default_from_dict, default_to_dict
from haystack import Document, component, default_from_dict, default_to_dict, logging
from haystack.components.builders import PromptBuilder
from haystack.components.generators import AzureOpenAIGenerator, OpenAIGenerator
from haystack.lazy_imports import LazyImport
Expand Down Expand Up @@ -49,18 +47,6 @@ def from_str(string: str) -> "LLMProvider":
raise ValueError(msg)
return provider

@classmethod
def from_dict(cls, data: str) -> "LLMProvider":
"""
Deserializes the component from a dictionary.

:param data:
Dictionary with serialized data.
:returns:
An instance of the component.
"""
return cls.from_str(data)


@component
class LLMMetadataExtractor:
Expand Down Expand Up @@ -157,14 +143,14 @@ def __init__( # pylint: disable=R0917
self.generator_api = generator_api
self.generator_api_params = generator_api_params or {}
self.llm_provider = self._init_generator(generator_api, self.generator_api_params)
self._check_prompt()

def _check_prompt(self):
if self.input_text not in self.prompt:
raise ValueError(f"{self.input_text} must be in the prompt.")
raise ValueError(f"Input text '{self.input_text}' must be in the prompt.")

@staticmethod
def _init_generator(generator_api: LLMProvider, generator_api_params: Optional[Dict[str, Any]]):
def _init_generator(
generator_api: LLMProvider,
generator_api_params: Optional[Dict[str, Any]]
) -> Union[OpenAIGenerator, AzureOpenAIGenerator, AmazonBedrockGenerator, VertexAIGeminiGenerator]:
"""
Initialize the chat generator based on the specified API provider and parameters.
"""
Expand Down Expand Up @@ -200,18 +186,16 @@ def is_valid_json_and_has_expected_keys(self, expected: List[str], received: str
try:
parsed_output = json.loads(received)
except json.JSONDecodeError:
msg = "Response from LLM evaluator is not a valid JSON."
msg = "Response from LLM is not a valid JSON."
if self.raise_on_failure:
raise ValueError(msg)
warn(msg)
logger.warning(msg)
return False

if not all(output in parsed_output for output in expected):
msg = f"Expected response from LLM evaluator to be JSON with keys {expected}, got {received}."
msg = f"Expected response from LLM to be a JSON with keys {expected}, got {received}."
if self.raise_on_failure:
raise ValueError(msg)
warn(msg)
logger.warning(msg)
return False

Expand Down Expand Up @@ -257,13 +241,15 @@ def from_dict(cls, data: Dict[str, Any]) -> "LLMMetadataExtractor":


@component.output_types(documents=List[Document], errors=List[Tuple[str,Any]])
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
def run(self, documents: List[Document]) -> Dict[str, Union[List[Document], List[Tuple[str, Any]]]]:
def run(self, documents: List[Document]) -> Dict[str, Any]:
"""
Extract metadata from documents using a Language Model.

:param documents: List of documents to extract metadata from.
:returns:
A dictionary with the key "documents_meta" containing the documents with extracted metadata.
A dictionary with the keys:
- "documents": List of documents with extracted metadata.
- "errors": List of tuples with document ID and error message or None if successful.
"""
errors = []
for document in documents:
Expand Down
7 changes: 5 additions & 2 deletions test/components/extractors/test_llm_metadata_extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@

class TestLLMMetadataExtractor:

def test_init_default(self):
def test_init_default(self, monkeypatch):
monkeypatch.setenv("OPENAI_API_KEY", "test-api-key")
extractor = LLMMetadataExtractor(
prompt="prompt {{test}}",
expected_keys=["key1", "key2"],
Expand All @@ -24,7 +25,8 @@ def test_init_default(self):
assert extractor.raise_on_failure is False
assert extractor.input_text == "test"

def test_init_with_parameters(self):
def test_init_with_parameters(self, monkeypatch):
monkeypatch.setenv("OPENAI_API_KEY", "test-api-key")
extractor = LLMMetadataExtractor(
prompt="prompt {{test}}",
expected_keys=["key1", "key2"],
Expand Down Expand Up @@ -101,6 +103,7 @@ def test_from_dict(self, monkeypatch):
assert extractor.prompt == "some prompt that was used with the LLM {{test}}"
assert extractor.generator_api == LLMProvider.OPENAI

@pytest.mark.integration
@pytest.mark.skipif(
not os.environ.get("OPENAI_API_KEY", None),
reason="Export an env var called OPENAI_API_KEY containing the OpenAI API key to run this test.",
Expand Down
Loading