Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add File type to preview package #5873

Merged
merged 6 commits into from
Oct 4, 2023
Merged

feat: add File type to preview package #5873

merged 6 commits into from
Oct 4, 2023

Conversation

masci
Copy link
Contributor

@masci masci commented Sep 25, 2023

Related Issues

Proposed Changes:

Introduce a new type that can be send over different components in a pipeline: ByteStream

How did you test it?

Unit tests

Notes for the reviewer

I started from #5856 but I couldn't really make work a data class with two optional fields: path and blob. In the end, I think that was code smell and I iterated aiming at simplicity. The idea is that if you want to exchange file paths across components, you can just use List[Path]. On the contrary, sending around bytes does deserve a nice and practical abstraction, so I thought about going with ByteStream directly, to avoid implying this has anything to do with files. The type comes with 2 utility methods that I imagine users would need: one method to save the ByteStream to a binary file, and another one to create a ByteStream instance by reading a file on disk.

I think we should also add metadata, but for the initial discussion let's focus on something small. Metadata added.

Checklist

@masci masci added the 2.x Related to Haystack v2.0 label Sep 25, 2023
@masci masci requested a review from vblagoje September 25, 2023 14:54
@masci masci requested review from a team as code owners September 25, 2023 14:54
@masci masci requested review from dfokina and julian-risch and removed request for a team September 25, 2023 14:54
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Sep 25, 2023
@ZanSara
Copy link
Contributor

ZanSara commented Sep 25, 2023

A bit of a radical take here: how is this different than sending around the byte steam as it is?

My question comes because:

  1. The dataclass is not attaching any additional information to the stream (like type or any information that can help routing/processing the bytes)
  2. The two methods are very simple (arguably trivial) ways to handle the stream.
  3. Having this abstraction requires users to be aware of it if the components using it happen to be at the start of the pipeline (which is not impossible if we consider file converters), so it may add some complexity on the user's end.

The way I interpreted the original issue was to find a way to make paths and byte streams easy to handle in an agnostic way, without having to care whether we were dealing with a file or with a byte stream. This PR however doesn't cover this situation, so I'm unsure what it wants to accomplish.

@coveralls
Copy link
Collaborator

coveralls commented Sep 25, 2023

Pull Request Test Coverage Report for Build 6302613835

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.02%) to 49.89%

Totals Coverage Status
Change from base Build 6302282666: 0.02%
Covered Lines: 12226
Relevant Lines: 24506

💛 - Coveralls

@masci
Copy link
Contributor Author

masci commented Sep 25, 2023

A bit of a radical take here: how is this different than sending around the byte steam as it is?

Metadata mainly, if the stream comes from reading a file, that could be easily tracked.

@ZanSara
Copy link
Contributor

ZanSara commented Sep 25, 2023

Metadata mainly, if the stream comes from reading a file, that could be easily tracked.

But the dataclass has no trace of metadata being supported... It's about adding it later?

@masci
Copy link
Contributor Author

masci commented Sep 25, 2023

Metadata mainly, if the stream comes from reading a file, that could be easily tracked.

But the dataclass has no trace of metadata being supported... It's about adding it later?

Yes, I mention it in the PR body but I see it backfired, I should've just added it

@ZanSara
Copy link
Contributor

ZanSara commented Sep 26, 2023

🤦‍♀️ Sorry for that!

Ok with metadata makes a ton more sense. One more question comes up though: we would be able to send metadata around with bytes, but not with paths. Is the plan to add something similar for paths later, or to include Paths in this dataclass somehow? From the description it sounds like this will be "reserved" for bytestreams only. And imho it can make signatures a bit confusing if they accept both bytestreams and paths, because you will need to send metadata separately but for the paths only...

@vblagoje
Copy link
Member

vblagoje commented Sep 26, 2023

@ZanSara @masci should we have content/mime type as a Blob property? If LinkContentFetcher is returning a Blob (without content type) how would it be routed to the appropriate ToDocument converter component?

@masci
Copy link
Contributor Author

masci commented Sep 26, 2023

@ZanSara @masci should we have content/mime type as a Blob property? If LinkContentFetcher is returning a Blob (without content type) how would it be routed to the appropriate ToDocument converter component?

I don't think we need content type for this,LinkContentFetcher knows the content type and would return the Blob instance in the proper output, say

    @component.output_types(json=Optional[Blob], pdf=Optional[Blob])
    def run(self, url: str):
        try:
            response = self._get_response(url)
            content_type = self._get_content_type(response)
            if content_type == "application/pdf"
                return {"pdf": Blob.from_bytes(response.content)}

@vblagoje
Copy link
Member

        if content_type == "application/pdf"
            return {"pdf": Blob.from_bytes(response.content)}

Right right. Now I remember. Very cool 🚀

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this and try it out further by creating more examples! 👍 I don't see a reason to block this. It's also easy to revert. The example you shared helped me to understand how you envision this to be used. Let's have more of those rather than theoretical discussions. Looking forward to the feedback from @vblagoje and his experiments with WebRetriever. Metadata would be great to add next so that we can keep track of the source where preprocessed Documents originated from.

@masci
Copy link
Contributor Author

masci commented Sep 28, 2023

Ok let me add metadata support and we can merge, still easy to revert / evolve!

@vblagoje
Copy link
Member

vblagoje commented Sep 28, 2023

Apologies for the delayed feedback; I wanted to play with this class in demos so that the feedback is relevant and hopefully constructive. In general, I like Blob, but I wanted to make a few suggestions:

1. Rename Blob to BinaryStream:

  • As we all know the term "Blob" is often associated with "Binary Large OBjects" in databases. While it implies the storage of binary data, it doesn't necessarily convey the operations or the handling of that data. IMHO, the name is a bit off.

  • BinaryStream is the best I could think of. Open to suggestions.

2. Add a from_bytes class method:

  • A method like from_bytes would allow for more flexibility in creating a BinaryStream object. Users might have raw bytes data in memory (e.g. from Response object) they want to convert into a BinaryStream without first saving it to a file. This method would facilitate that functionality.
@classmethod
def from_bytes(cls, data: io.BytesIO) -> "BinaryStream":
    return cls(data=data.read())

3. Add a from_text class method:

  • Providing a method to convert text into a BinaryStream object directly would be highly beneficial for users who deal with text data but want to abstract it as BinaryStream for signature consistency. Without it, we would need to provide some sort of TextStream, but as far as I understand, Python IO text stream would be binary with encoding/decoding added. I might be totally off here. Please lmk.

  • The optional encoding parameter would offer flexibility for different text encodings, making the class more versatile.

@classmethod
def from_text(cls, text: str, encoding: str = "utf-8") -> "BinaryStream":
    return cls(data=text.encode(encoding))

The introduction of BinaryStream provides a unified interface for various file converters in Haystack (and beyond), in addition to the existing str and Path. By allowing each converter to accept a BinaryStream we simplify the data handling process across different converters.

How would this work?

For converters that deal with text data, they can confidently decode the data from the BinaryStream, knowing that it encapsulates binary representations of text. For converters that handle formats like PDF, they can directly access the raw bytes from the BinaryStream without any additional decoding.

@vblagoje
Copy link
Member

Or perhaps ByteStream 🤷‍♀️

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, perhaps rename the file and add a single test per "new" method as well?

@masci masci requested a review from vblagoje October 4, 2023 10:25
@vblagoje
Copy link
Member

vblagoje commented Oct 4, 2023

All seems good but I haven't tried setting mime-type (content-type) in metadata. Because we'll need a mime-type being set somehow and accessed in #5965
Not sure if I'll get FrozenInstanceError thingy

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no frozen instance issues when we assign key/value pairs in metadata. Looks gtg.

@vblagoje
Copy link
Member

vblagoje commented Oct 4, 2023

@masci one sec, what happened with from_bytes class method? Aha, nevermind, one can use init directly

data: bytes
metadata: Dict[str, Any] = field(default_factory=dict, hash=False)

def to_file(self, destination_path: Path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking this PR of course, but why we have a to_file method and not a to_string method? Is there a real issue or it's just that we don't expect it to be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter, to_file was inspired by Tika working only on files, to_string doesn't have a real application so far

@masci masci merged commit c2ec3f5 into main Oct 4, 2023
20 checks passed
@masci masci deleted the massi/filetype branch October 4, 2023 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add file abstraction
6 participants