Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metadata extractor based on a LLM #92

Merged
merged 56 commits into from
Sep 30, 2024
Merged

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Sep 13, 2024

Related Issues

Proposed Changes:

  • Adding a new component which relies on a LLM and a prompt to extract metadata from a Document. Meant to be used within a pipeline during indexing. This was a request from @sjrl - have a look, you can pull this branch and test it

How did you test it?

  • added unit tests and integration tests
  • manual verification

Checklist

@coveralls
Copy link

coveralls commented Sep 13, 2024

Pull Request Test Coverage Report for Build 11104881204

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.7%) to 88.558%

Totals Coverage Status
Change from base Build 11070052105: -1.7%
Covered Lines: 1486
Relevant Lines: 1678

💛 - Coveralls

@davidsbatista davidsbatista marked this pull request as ready for review September 13, 2024 19:51
@davidsbatista davidsbatista requested a review from a team as a code owner September 13, 2024 19:52
@davidsbatista davidsbatista requested review from anakin87, shadeMe and sjrl and removed request for a team September 13, 2024 19:52
@julian-risch
Copy link
Member

@davidsbatista We should list the new component in the catalog as part of this PR: https://github.com/deepset-ai/haystack-experimental?tab=readme-ov-file#experiments-catalog

@sjrl
Copy link
Collaborator

sjrl commented Sep 17, 2024

Hey @davidsbatista thanks for you work on this!

One idea that came to mind that this component doesn't yet quite solve is how to handle extracting global information about a Document (at least not elegantly). For example, let's say I want to extract the title from a really long PDF (e.g. more than 200 pages) and store this title in all future chunks I create from this PDF file. With the way this component works I would need to do something like:

File -> PDFConverter -> LLMMetadataExtractor -> DocumentSplitter -> DocumentWriter

but since we have such a long PDF there is a good chance that it will surpass the context window of the LLM and in reality all I need to do is send the first few pages of the file to find the title so this version also feels inefficient latency and cost-wise.

So I was wondering if we could brainstorm a way to handle this type of use case as well? Maybe simply to allow a user to specify a page range as well at init time so we don't need to send the whole Document to the LLM?

@davidsbatista davidsbatista requested a review from a team as a code owner September 17, 2024 15:39
@davidsbatista davidsbatista requested review from dfokina and removed request for a team September 17, 2024 15:39
@davidsbatista davidsbatista requested review from tstadel and removed request for anakin87 and tstadel September 26, 2024 08:35
@davidsbatista
Copy link
Contributor Author

@shadeMe we would be happy to merge it, do you still want to have a review?

Copy link
Collaborator

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this!

Copy link
Contributor

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you've missed a comment above.

@davidsbatista davidsbatista merged commit 50ce5fd into main Sep 30, 2024
10 checks passed
@davidsbatista davidsbatista deleted the add-metadata-extractor branch September 30, 2024 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentsBuilder
7 participants