-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: metadata extractor based on a LLM #92
Conversation
Pull Request Test Coverage Report for Build 11104881204Details
💛 - Coveralls |
@davidsbatista We should list the new component in the catalog as part of this PR: https://github.com/deepset-ai/haystack-experimental?tab=readme-ov-file#experiments-catalog |
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
Hey @davidsbatista thanks for you work on this! One idea that came to mind that this component doesn't yet quite solve is how to handle extracting global information about a Document (at least not elegantly). For example, let's say I want to extract the title from a really long PDF (e.g. more than 200 pages) and store this title in all future chunks I create from this PDF file. With the way this component works I would need to do something like: File -> PDFConverter -> LLMMetadataExtractor -> DocumentSplitter -> DocumentWriter but since we have such a long PDF there is a good chance that it will surpass the context window of the LLM and in reality all I need to do is send the first few pages of the file to find the title so this version also feels inefficient latency and cost-wise. So I was wondering if we could brainstorm a way to handle this type of use case as well? Maybe simply to allow a user to specify a page range as well at init time so we don't need to send the whole Document to the LLM? |
@shadeMe we would be happy to merge it, do you still want to have a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this!
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Madeesh Kannan <[email protected]>
…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>
…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>
…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you've missed a comment above.
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/extractors/llm_metadata_extractor.py
Outdated
Show resolved
Hide resolved
…ctor.py Co-authored-by: Madeesh Kannan <[email protected]>
Related Issues
Proposed Changes:
How did you test it?
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.