feat: metadata extractor based on a LLM #92

davidsbatista · 2024-09-13T17:43:49Z

Related Issues

fixes #5700 and #5702

Proposed Changes:

Adding a new component which relies on a LLM and a prompt to extract metadata from a Document. Meant to be used within a pipeline during indexing. This was a request from @sjrl - have a look, you can pull this branch and test it

How did you test it?

added unit tests and integration tests
manual verification

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2024-09-13T19:51:24Z

Pull Request Test Coverage Report for Build 11104881204

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-1.7%) to 88.558%

Totals
Change from base Build 11070052105:	-1.7%
Covered Lines:	1486
Relevant Lines:	1678

💛 - Coveralls

julian-risch · 2024-09-17T07:58:05Z

@davidsbatista We should list the new component in the catalog as part of this PR: https://github.com/deepset-ai/haystack-experimental?tab=readme-ov-file#experiments-catalog

haystack_experimental/components/extractors/llm_metadata_extractor.py

sjrl · 2024-09-17T08:45:58Z

Hey @davidsbatista thanks for you work on this!

One idea that came to mind that this component doesn't yet quite solve is how to handle extracting global information about a Document (at least not elegantly). For example, let's say I want to extract the title from a really long PDF (e.g. more than 200 pages) and store this title in all future chunks I create from this PDF file. With the way this component works I would need to do something like:

File -> PDFConverter -> LLMMetadataExtractor -> DocumentSplitter -> DocumentWriter

but since we have such a long PDF there is a good chance that it will surpass the context window of the LLM and in reality all I need to do is send the first few pages of the file to find the title so this version also feels inefficient latency and cost-wise.

So I was wondering if we could brainstorm a way to handle this type of use case as well? Maybe simply to allow a user to specify a page range as well at init time so we don't need to send the whole Document to the LLM?

davidsbatista · 2024-09-26T08:40:29Z

@shadeMe we would be happy to merge it, do you still want to have a review?

sjrl

Thanks for the work on this!

README.md

haystack_experimental/components/extractors/llm_metadata_extractor.py

test/components/extractors/test_llm_metadata_extractor.py

Co-authored-by: Madeesh Kannan <[email protected]>

…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>

shadeMe

Looks like you've missed a comment above.

haystack_experimental/components/extractors/llm_metadata_extractor.py

…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>

davidsbatista added 11 commits September 13, 2024 17:41

initial import

c0d128c

adding tests

83ee863

adding docstrings

8687071

handlint liting

0c25951

fixing tests

2569584

improving live run test

598904d

fixing docstring

eb0c893

fixing tests

435260f

fixing tests

77a5808

fixing tests

c239c6c

fixing tests

35b948d

davidsbatista marked this pull request as ready for review September 13, 2024 19:51

davidsbatista requested a review from a team as a code owner September 13, 2024 19:52

davidsbatista requested review from anakin87, shadeMe and sjrl and removed request for a team September 13, 2024 19:52

sjrl reviewed Sep 17, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 17, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 17, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 17, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 17, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 17, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

PR reviews/comments

c561664

davidsbatista requested a review from a team as a code owner September 17, 2024 15:39

davidsbatista requested review from dfokina and removed request for a team September 17, 2024 15:39

davidsbatista requested review from tstadel and removed request for anakin87 and tstadel September 26, 2024 08:35

sjrl approved these changes Sep 26, 2024

View reviewed changes

shadeMe suggested changes Sep 26, 2024

View reviewed changes

davidsbatista and others added 8 commits September 26, 2024 16:52

Update README.md

2e74dc4

Co-authored-by: Madeesh Kannan <[email protected]>

Update haystack_experimental/components/extractors/llm_metadata_extra…

ba51ca1

…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>

Update haystack_experimental/components/extractors/llm_metadata_extra…

3e4b28e

…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>

Update haystack_experimental/components/extractors/llm_metadata_extra…

237eb72

…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>

fixes

13030b7

fixing linted files

46af362

fixing

a826f11

Merge branch 'main' into add-metadata-extractor

b999b5b

davidsbatista requested a review from shadeMe September 26, 2024 15:19

shadeMe suggested changes Sep 27, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

davidsbatista added 6 commits September 27, 2024 14:08

removing tuples from the output two aligned lists

423f489

Merge branch 'main' into add-metadata-extractor

194a3fd

removed unused import

0da5082

chaning errors to a dictionary

a9fa803

...

1973bd3

fixing LLMProvider

359e1aa

shadeMe approved these changes Sep 30, 2024

View reviewed changes

haystack_experimental/components/extractors/llm_metadata_extractor.py Outdated Show resolved Hide resolved

davidsbatista and others added 3 commits September 30, 2024 12:53

Update haystack_experimental/components/extractors/llm_metadata_extra…

693a09c

…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>

more fixes

8cfb58f

fixing linting issue

bf6adf5

davidsbatista merged commit 50ce5fd into main Sep 30, 2024
10 checks passed

davidsbatista deleted the add-metadata-extractor branch September 30, 2024 11:22

anakin87 mentioned this pull request Oct 1, 2024

AmazonBedrockGenerator is not imported lazily #109

Closed

anakin87 mentioned this pull request Oct 28, 2024

MetadataBuilder deepset-ai/haystack#5702

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: metadata extractor based on a LLM #92

feat: metadata extractor based on a LLM #92

davidsbatista commented Sep 13, 2024 •

edited by julian-risch

Loading

coveralls commented Sep 13, 2024 •

edited

Loading

julian-risch commented Sep 17, 2024

sjrl commented Sep 17, 2024

davidsbatista commented Sep 26, 2024

sjrl left a comment

shadeMe left a comment

feat: metadata extractor based on a LLM #92

feat: metadata extractor based on a LLM #92

Conversation

davidsbatista commented Sep 13, 2024 • edited by julian-risch Loading

Related Issues

Proposed Changes:

How did you test it?

Checklist

coveralls commented Sep 13, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11104881204

Details

💛 - Coveralls

julian-risch commented Sep 17, 2024

sjrl commented Sep 17, 2024

davidsbatista commented Sep 26, 2024

sjrl left a comment

Choose a reason for hiding this comment

shadeMe left a comment

Choose a reason for hiding this comment

davidsbatista commented Sep 13, 2024 •

edited by julian-risch

Loading

coveralls commented Sep 13, 2024 •

edited

Loading