-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support bytestream in Unstructured API #1082
base: main
Are you sure you want to change the base?
Conversation
@vblagoje , I try to address your request in this PR. |
@alperkaya can you please sign the CLA? Otherwise we can't merge this. :) |
done ;) |
...tions/unstructured/src/haystack_integrations/components/converters/unstructured/converter.py
Show resolved
Hide resolved
for filepath, metadata in tqdm( | ||
zip(all_filepaths, meta_list), desc="Converting files to Haystack Documents", disable=not self.progress_bar | ||
zip(all_filepaths, meta_list[:len(all_filepaths)]), desc="Converting files to Haystack Documents" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong.
If you have a combination of paths and ByteStream
in your sources
and a list for meta
you risk assigning the wrong meta
to the wrong Document
.
You can verify with something like this:
from pathlib import Path
from haystack.dataclasses import ByteStream
from haystack_integrations.components.converters.unstructured import (
UnstructuredFileConverter,
)
converter = UnstructuredFileConverter()
sources = [
"README.md",
ByteStream(data=b"content", meta={"file_path": "some_file.md"}),
Path(__file__),
ByteStream(data=b"content", meta={"file_path": "yet_another_file.md"}),
ByteStream(data=b"content", meta={"file_path": "my_file.md"}),
]
meta = [
{"type": "str"},
{"type": "ByteStream"},
{"type": "Path"},
{"type": "ByteStream"},
{"type": "ByteStream"},
]
res = converter.run(sources=sources, meta=meta)
Also I notice that some meta
fields are completely lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation has quite some problems, I would suggest taking a look at the existing converters we have in core.
https://github.com/deepset-ai/haystack/tree/main/haystack/components/converters
Hi @silvanocerza, I try to address your comment by checking the existing converters. This version covers these cases without losing meta fields. Case 1: Files with Meta as None |
This is still not working as expected, I strongly suggest you copy the implementation from core. |
Hi, in the core repo, the solution handles either file paths or bytestreams. However, in this PR, I’m managing not just file paths and bytestreams, but also directories, where I need to fetch all files without entering subfolders. Given these additional requirements, could we explore how to modify the core solution to meet this use case, or would an alternative approach be better suited here? |
Related Issues
ByteStream
input in run method #1075Proposed Changes:
Unstructured API can also be called with Bytestream now in addition of Path.
How did you test it?
Added extra unit tests
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.