Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnstructuredFileConverter: support custom metadata dict #241

Closed
lambda-science opened this issue Jan 18, 2024 · 3 comments · Fixed by #242
Closed

UnstructuredFileConverter: support custom metadata dict #241

lambda-science opened this issue Jan 18, 2024 · 3 comments · Fixed by #242
Assignees
Labels
feature request Ideas to improve an integration

Comments

@lambda-science
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Converter such as PyPDFToDocument take as input file_paths and an optional meta list of dict such as indexing_pipeline.run({"PDFToTextConverter": {"sources": file_paths, "meta": file_metas}})
Currently UnstructuredFileConverter only support paths parameter such as: indexing_pipeline.run({"UnstructuredFileConverter": {"paths": file_paths}}), adding meta list of dict results in a crash. Also it automatically extract it's own meta (which is good, but I wish to extend them)
image

Describe the solution you'd like
Modify the UnstructuredFileConverter to support an optional meta list of dict to be added to the Document.

Describe alternatives you've considered
Maybe a custom component that is like MetadataWritter that could add meta dict after the converter and before the writer. But it could be maybe messy (because UnstructuredFileConverter chunk documents, so the meta list of dict that has the length of your file_path, will not be the same length as output documents from UnstructuredFileConverter and then you have to merge two dict because the meta dict already exist in the doc

@lambda-science lambda-science added the feature request Ideas to improve an integration label Jan 18, 2024
@anakin87
Copy link
Member

This sounds like a reasonable feature request!

Unfortunately, the implementation might be non-trivial because the UnstructuredFileConverter also accepts directories as paths and automatically searches for files.

However, if you want to open a PR that covers the simplest use cases, feel free to do so.
Otherwise, we'll take a look in the future...

@lambda-science
Copy link
Contributor Author

This sounds like a reasonable feature request!

Unfortunately, the implementation might be non-trivial because the UnstructuredFileConverter also accepts directories as paths and automatically searches for files.

However, if you want to open a PR that covers the simplest use cases, feel free to do so. Otherwise, we'll take a look in the future...

Working on it right now, I just have to fix a test, but it looks very doable :)

@lambda-science
Copy link
Contributor Author

Proposed MR: #242

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Ideas to improve an integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants