You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Converter such as PyPDFToDocument take as input file_paths and an optional meta list of dict such as indexing_pipeline.run({"PDFToTextConverter": {"sources": file_paths, "meta": file_metas}})
Currently UnstructuredFileConverter only support paths parameter such as: indexing_pipeline.run({"UnstructuredFileConverter": {"paths": file_paths}}), adding meta list of dict results in a crash. Also it automatically extract it's own meta (which is good, but I wish to extend them)
Describe the solution you'd like
Modify the UnstructuredFileConverter to support an optional meta list of dict to be added to the Document.
Describe alternatives you've considered
Maybe a custom component that is like MetadataWritter that could add meta dict after the converter and before the writer. But it could be maybe messy (because UnstructuredFileConverter chunk documents, so the meta list of dict that has the length of your file_path, will not be the same length as output documents from UnstructuredFileConverter and then you have to merge two dict because the meta dict already exist in the doc
The text was updated successfully, but these errors were encountered:
Unfortunately, the implementation might be non-trivial because the UnstructuredFileConverter also accepts directories as paths and automatically searches for files.
However, if you want to open a PR that covers the simplest use cases, feel free to do so.
Otherwise, we'll take a look in the future...
Unfortunately, the implementation might be non-trivial because the UnstructuredFileConverter also accepts directories as paths and automatically searches for files.
However, if you want to open a PR that covers the simplest use cases, feel free to do so. Otherwise, we'll take a look in the future...
Working on it right now, I just have to fix a test, but it looks very doable :)
Is your feature request related to a problem? Please describe.
Converter such as
PyPDFToDocument
take as inputfile_paths
and an optionalmeta
list of dict such asindexing_pipeline.run({"PDFToTextConverter": {"sources": file_paths, "meta": file_metas}})
Currently
UnstructuredFileConverter
only supportpaths
parameter such as:indexing_pipeline.run({"UnstructuredFileConverter": {"paths": file_paths}})
, addingmeta
list of dict results in a crash. Also it automatically extract it's ownmeta
(which is good, but I wish to extend them)Describe the solution you'd like
Modify the
UnstructuredFileConverter
to support an optionalmeta
list of dict to be added to the Document.Describe alternatives you've considered
Maybe a custom component that is like
MetadataWritter
that could addmeta
dict after theconverter
and before thewriter
. But it could be maybe messy (becauseUnstructuredFileConverter
chunk documents, so themeta
list of dict that has the length of yourfile_path
, will not be the same length as output documents fromUnstructuredFileConverter
and then you have to merge two dict because the meta dict already exist in the docThe text was updated successfully, but these errors were encountered: