-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add File type to preview package #5873
Conversation
A bit of a radical take here: how is this different than sending around the byte steam as it is? My question comes because:
The way I interpreted the original issue was to find a way to make paths and byte streams easy to handle in an agnostic way, without having to care whether we were dealing with a file or with a byte stream. This PR however doesn't cover this situation, so I'm unsure what it wants to accomplish. |
Pull Request Test Coverage Report for Build 6302613835
💛 - Coveralls |
Metadata mainly, if the stream comes from reading a file, that could be easily tracked. |
But the dataclass has no trace of metadata being supported... It's about adding it later? |
Yes, I mention it in the PR body but I see it backfired, I should've just added it |
🤦♀️ Sorry for that! Ok with metadata makes a ton more sense. One more question comes up though: we would be able to send metadata around with bytes, but not with paths. Is the plan to add something similar for paths later, or to include Paths in this dataclass somehow? From the description it sounds like this will be "reserved" for bytestreams only. And imho it can make signatures a bit confusing if they accept both bytestreams and paths, because you will need to send metadata separately but for the paths only... |
I don't think we need content type for this, @component.output_types(json=Optional[Blob], pdf=Optional[Blob])
def run(self, url: str):
try:
response = self._get_response(url)
content_type = self._get_content_type(response)
if content_type == "application/pdf"
return {"pdf": Blob.from_bytes(response.content)} |
Right right. Now I remember. Very cool 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this and try it out further by creating more examples! 👍 I don't see a reason to block this. It's also easy to revert. The example you shared helped me to understand how you envision this to be used. Let's have more of those rather than theoretical discussions. Looking forward to the feedback from @vblagoje and his experiments with WebRetriever. Metadata would be great to add next so that we can keep track of the source where preprocessed Documents originated from.
Ok let me add metadata support and we can merge, still easy to revert / evolve! |
Apologies for the delayed feedback; I wanted to play with this class in demos so that the feedback is relevant and hopefully constructive. In general, I like Blob, but I wanted to make a few suggestions: 1. Rename
|
Or perhaps ByteStream 🤷♀️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid, perhaps rename the file and add a single test per "new" method as well?
All seems good but I haven't tried setting mime-type (content-type) in metadata. Because we'll need a mime-type being set somehow and accessed in #5965 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no frozen instance issues when we assign key/value pairs in metadata. Looks gtg.
@masci one sec, what happened with |
data: bytes | ||
metadata: Dict[str, Any] = field(default_factory=dict, hash=False) | ||
|
||
def to_file(self, destination_path: Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking this PR of course, but why we have a to_file
method and not a to_string
method? Is there a real issue or it's just that we don't expect it to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter, to_file
was inspired by Tika working only on files, to_string
doesn't have a real application so far
Related Issues
Proposed Changes:
Introduce a new type that can be send over different components in a pipeline:
ByteStream
How did you test it?
Unit tests
Notes for the reviewer
I started from #5856 but I couldn't really make work a data class with two optional fields: path and blob. In the end, I think that was code smell and I iterated aiming at simplicity. The idea is that if you want to exchange file paths across components, you can just use
List[Path]
. On the contrary, sending around bytes does deserve a nice and practical abstraction, so I thought about going withByteStream
directly, to avoid implying this has anything to do with files. The type comes with 2 utility methods that I imagine users would need: one method to save theByteStream
to a binary file, and another one to create aByteStream
instance by reading a file on disk.I think we should also add metadata, but for the initial discussion let's focus on something small.Metadata added.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.