feat: add File type to preview package #5873

masci · 2023-09-25T14:54:01Z

Related Issues

fixes Add file abstraction #5856

Proposed Changes:

Introduce a new type that can be send over different components in a pipeline: ByteStream

How did you test it?

Unit tests

Notes for the reviewer

I started from #5856 but I couldn't really make work a data class with two optional fields: path and blob. In the end, I think that was code smell and I iterated aiming at simplicity. The idea is that if you want to exchange file paths across components, you can just use List[Path]. On the contrary, sending around bytes does deserve a nice and practical abstraction, so I thought about going with ByteStream directly, to avoid implying this has anything to do with files. The type comes with 2 utility methods that I imagine users would need: one method to save the ByteStream to a binary file, and another one to create a ByteStream instance by reading a file on disk.

~~I think we should also add metadata, but for the initial discussion let's focus on something small.~~ Metadata added.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

ZanSara · 2023-09-25T15:10:25Z

A bit of a radical take here: how is this different than sending around the byte steam as it is?

My question comes because:

The dataclass is not attaching any additional information to the stream (like type or any information that can help routing/processing the bytes)
The two methods are very simple (arguably trivial) ways to handle the stream.
Having this abstraction requires users to be aware of it if the components using it happen to be at the start of the pipeline (which is not impossible if we consider file converters), so it may add some complexity on the user's end.

The way I interpreted the original issue was to find a way to make paths and byte streams easy to handle in an agnostic way, without having to care whether we were dealing with a file or with a byte stream. This PR however doesn't cover this situation, so I'm unsure what it wants to accomplish.

coveralls · 2023-09-25T15:13:58Z

Pull Request Test Coverage Report for Build 6302613835

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.02%) to 49.89%

Totals
Change from base Build 6302282666:	0.02%
Covered Lines:	12226
Relevant Lines:	24506

💛 - Coveralls

masci · 2023-09-25T15:15:53Z

A bit of a radical take here: how is this different than sending around the byte steam as it is?

Metadata mainly, if the stream comes from reading a file, that could be easily tracked.

ZanSara · 2023-09-25T15:45:04Z

Metadata mainly, if the stream comes from reading a file, that could be easily tracked.

But the dataclass has no trace of metadata being supported... It's about adding it later?

masci · 2023-09-25T15:50:05Z

Metadata mainly, if the stream comes from reading a file, that could be easily tracked.

But the dataclass has no trace of metadata being supported... It's about adding it later?

Yes, I mention it in the PR body but I see it backfired, I should've just added it

ZanSara · 2023-09-26T08:34:23Z

🤦‍♀️ Sorry for that!

Ok with metadata makes a ton more sense. One more question comes up though: we would be able to send metadata around with bytes, but not with paths. Is the plan to add something similar for paths later, or to include Paths in this dataclass somehow? From the description it sounds like this will be "reserved" for bytestreams only. And imho it can make signatures a bit confusing if they accept both bytestreams and paths, because you will need to send metadata separately but for the paths only...

vblagoje · 2023-09-26T13:23:49Z

@ZanSara @masci should we have content/mime type as a Blob property? If LinkContentFetcher is returning a Blob (without content type) how would it be routed to the appropriate ToDocument converter component?

masci · 2023-09-26T13:49:32Z

@ZanSara @masci should we have content/mime type as a Blob property? If LinkContentFetcher is returning a Blob (without content type) how would it be routed to the appropriate ToDocument converter component?

I don't think we need content type for this,LinkContentFetcher knows the content type and would return the Blob instance in the proper output, say

    @component.output_types(json=Optional[Blob], pdf=Optional[Blob])
    def run(self, url: str):
        try:
            response = self._get_response(url)
            content_type = self._get_content_type(response)
            if content_type == "application/pdf"
                return {"pdf": Blob.from_bytes(response.content)}

vblagoje · 2023-09-26T14:05:03Z

        if content_type == "application/pdf"
            return {"pdf": Blob.from_bytes(response.content)}

Right right. Now I remember. Very cool 🚀

julian-risch

Let's merge this and try it out further by creating more examples! 👍 I don't see a reason to block this. It's also easy to revert. The example you shared helped me to understand how you envision this to be used. Let's have more of those rather than theoretical discussions. Looking forward to the feedback from @vblagoje and his experiments with WebRetriever. Metadata would be great to add next so that we can keep track of the source where preprocessed Documents originated from.

masci · 2023-09-28T15:15:39Z

Ok let me add metadata support and we can merge, still easy to revert / evolve!

vblagoje · 2023-09-28T15:18:39Z

Apologies for the delayed feedback; I wanted to play with this class in demos so that the feedback is relevant and hopefully constructive. In general, I like Blob, but I wanted to make a few suggestions:

1. Rename `Blob` to `BinaryStream`:

As we all know the term "Blob" is often associated with "Binary Large OBjects" in databases. While it implies the storage of binary data, it doesn't necessarily convey the operations or the handling of that data. IMHO, the name is a bit off.
BinaryStream is the best I could think of. Open to suggestions.

2. Add a `from_bytes` class method:

A method like from_bytes would allow for more flexibility in creating a BinaryStream object. Users might have raw bytes data in memory (e.g. from Response object) they want to convert into a BinaryStream without first saving it to a file. This method would facilitate that functionality.

@classmethod
def from_bytes(cls, data: io.BytesIO) -> "BinaryStream":
    return cls(data=data.read())

3. Add a `from_text` class method:

Providing a method to convert text into a BinaryStream object directly would be highly beneficial for users who deal with text data but want to abstract it as BinaryStream for signature consistency. Without it, we would need to provide some sort of TextStream, but as far as I understand, Python IO text stream would be binary with encoding/decoding added. I might be totally off here. Please lmk.
The optional encoding parameter would offer flexibility for different text encodings, making the class more versatile.

@classmethod
def from_text(cls, text: str, encoding: str = "utf-8") -> "BinaryStream":
    return cls(data=text.encode(encoding))

The introduction of BinaryStream provides a unified interface for various file converters in Haystack (and beyond), in addition to the existing str and Path. By allowing each converter to accept a BinaryStream we simplify the data handling process across different converters.

How would this work?

For converters that deal with text data, they can confidently decode the data from the BinaryStream, knowing that it encapsulates binary representations of text. For converters that handle formats like PDF, they can directly access the raw bytes from the BinaryStream without any additional decoding.

vblagoje · 2023-09-28T16:36:38Z

Or perhaps ByteStream 🤷‍♀️

vblagoje

Looks solid, perhaps rename the file and add a single test per "new" method as well?

vblagoje · 2023-10-04T10:40:05Z

All seems good but I haven't tried setting mime-type (content-type) in metadata. Because we'll need a mime-type being set somehow and accessed in #5965
Not sure if I'll get FrozenInstanceError thingy

vblagoje

Ok, no frozen instance issues when we assign key/value pairs in metadata. Looks gtg.

vblagoje · 2023-10-04T12:23:11Z

@masci one sec, what happened with from_bytes class method? Aha, nevermind, one can use init directly

ZanSara · 2023-10-04T14:10:29Z

haystack/preview/dataclasses/byte_stream.py

+    data: bytes
+    metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
+
+    def to_file(self, destination_path: Path):


Not blocking this PR of course, but why we have a to_file method and not a to_string method? Is there a real issue or it's just that we don't expect it to be used?

The latter, to_file was inspired by Tika working only on files, to_string doesn't have a real application so far

add Blob type

e3be245

masci added the 2.x Related to Haystack v2.0 label Sep 25, 2023

masci requested a review from vblagoje September 25, 2023 14:54

masci requested review from a team as code owners September 25, 2023 14:54

masci requested review from dfokina and julian-risch and removed request for a team September 25, 2023 14:54

github-actions bot added topic:tests type:documentation Improvements on the docs labels Sep 25, 2023

Merge branch 'main' into massi/filetype

4a66813

julian-risch approved these changes Sep 28, 2023

View reviewed changes

review feedback

531301e

vblagoje approved these changes Oct 3, 2023

View reviewed changes

masci added 2 commits October 3, 2023 16:13

fix tests and naming

d633a1c

Update add-blob-type-2a9476a39841f54d.yaml

b4c2b8f

vblagoje mentioned this pull request Oct 4, 2023

Enhance and Rename FileExtensionRouter to FileTypeRouter in Haystack 2.0 #5965

Closed

removed unused import

4a87e59

masci requested a review from vblagoje October 4, 2023 10:25

vblagoje approved these changes Oct 4, 2023

View reviewed changes

ZanSara reviewed Oct 4, 2023

View reviewed changes

masci merged commit c2ec3f5 into main Oct 4, 2023
20 checks passed

masci deleted the massi/filetype branch October 4, 2023 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add File type to preview package #5873

feat: add File type to preview package #5873

masci commented Sep 25, 2023 •

edited

Loading

ZanSara commented Sep 25, 2023 •

edited

Loading

coveralls commented Sep 25, 2023 •

edited

Loading

masci commented Sep 25, 2023

ZanSara commented Sep 25, 2023

masci commented Sep 25, 2023

ZanSara commented Sep 26, 2023

vblagoje commented Sep 26, 2023 •

edited

Loading

masci commented Sep 26, 2023

vblagoje commented Sep 26, 2023

julian-risch left a comment

masci commented Sep 28, 2023

vblagoje commented Sep 28, 2023 •

edited

Loading

vblagoje commented Sep 28, 2023

vblagoje left a comment

vblagoje commented Oct 4, 2023 •

edited

Loading

vblagoje left a comment

vblagoje commented Oct 4, 2023 •

edited

Loading

ZanSara Oct 4, 2023

masci Oct 4, 2023

feat: add File type to preview package #5873

feat: add File type to preview package #5873

Conversation

masci commented Sep 25, 2023 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

ZanSara commented Sep 25, 2023 • edited Loading

coveralls commented Sep 25, 2023 • edited Loading

Pull Request Test Coverage Report for Build 6302613835

💛 - Coveralls

masci commented Sep 25, 2023

ZanSara commented Sep 25, 2023

masci commented Sep 25, 2023

ZanSara commented Sep 26, 2023

vblagoje commented Sep 26, 2023 • edited Loading

masci commented Sep 26, 2023

vblagoje commented Sep 26, 2023

julian-risch left a comment

Choose a reason for hiding this comment

masci commented Sep 28, 2023

vblagoje commented Sep 28, 2023 • edited Loading

1. Rename Blob to BinaryStream:

2. Add a from_bytes class method:

3. Add a from_text class method:

vblagoje commented Sep 28, 2023

vblagoje left a comment

Choose a reason for hiding this comment

vblagoje commented Oct 4, 2023 • edited Loading

vblagoje left a comment

Choose a reason for hiding this comment

vblagoje commented Oct 4, 2023 • edited Loading

ZanSara Oct 4, 2023

Choose a reason for hiding this comment

masci Oct 4, 2023

Choose a reason for hiding this comment

masci commented Sep 25, 2023 •

edited

Loading

ZanSara commented Sep 25, 2023 •

edited

Loading

coveralls commented Sep 25, 2023 •

edited

Loading

vblagoje commented Sep 26, 2023 •

edited

Loading

vblagoje commented Sep 28, 2023 •

edited

Loading

1. Rename `Blob` to `BinaryStream`:

2. Add a `from_bytes` class method:

3. Add a `from_text` class method:

vblagoje commented Oct 4, 2023 •

edited

Loading

vblagoje commented Oct 4, 2023 •

edited

Loading