UnstructuredFileConverter: support custom metadata dict #241

lambda-science · 2024-01-18T10:58:52Z

Is your feature request related to a problem? Please describe.
Converter such as PyPDFToDocument take as input file_paths and an optional meta list of dict such as indexing_pipeline.run({"PDFToTextConverter": {"sources": file_paths, "meta": file_metas}})
Currently UnstructuredFileConverter only support paths parameter such as: indexing_pipeline.run({"UnstructuredFileConverter": {"paths": file_paths}}), adding meta list of dict results in a crash. Also it automatically extract it's own meta (which is good, but I wish to extend them)

Describe the solution you'd like
Modify the UnstructuredFileConverter to support an optional meta list of dict to be added to the Document.

Describe alternatives you've considered
Maybe a custom component that is like MetadataWritter that could add meta dict after the converter and before the writer. But it could be maybe messy (because UnstructuredFileConverter chunk documents, so the meta list of dict that has the length of your file_path, will not be the same length as output documents from UnstructuredFileConverter and then you have to merge two dict because the meta dict already exist in the doc

The text was updated successfully, but these errors were encountered:

anakin87 · 2024-01-18T11:09:56Z

This sounds like a reasonable feature request!

Unfortunately, the implementation might be non-trivial because the UnstructuredFileConverter also accepts directories as paths and automatically searches for files.

However, if you want to open a PR that covers the simplest use cases, feel free to do so.
Otherwise, we'll take a look in the future...

lambda-science · 2024-01-18T12:20:59Z

This sounds like a reasonable feature request!

Unfortunately, the implementation might be non-trivial because the UnstructuredFileConverter also accepts directories as paths and automatically searches for files.

However, if you want to open a PR that covers the simplest use cases, feel free to do so. Otherwise, we'll take a look in the future...

Working on it right now, I just have to fix a test, but it looks very doable :)

lambda-science · 2024-01-18T12:36:02Z

Proposed MR: #242

lambda-science added the feature request Ideas to improve an integration label Jan 18, 2024

lambda-science mentioned this issue Jan 18, 2024

Feat: UnstructuredFileConverter meta field #242

Merged

masci assigned anakin87 Jan 22, 2024

anakin87 closed this as completed in #242 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnstructuredFileConverter: support custom metadata dict #241

UnstructuredFileConverter: support custom metadata dict #241

lambda-science commented Jan 18, 2024

anakin87 commented Jan 18, 2024

lambda-science commented Jan 18, 2024

lambda-science commented Jan 18, 2024

UnstructuredFileConverter: support custom metadata dict #241

UnstructuredFileConverter: support custom metadata dict #241

Comments

lambda-science commented Jan 18, 2024

anakin87 commented Jan 18, 2024

lambda-science commented Jan 18, 2024

lambda-science commented Jan 18, 2024