cat_file with start and end of gzipped file does not work. #512

racinmat · 2022-12-08T14:53:13Z

Reading gzipped file using transcoding works when you use the fs.open, but not when using fs.cat_file.
Here is and example uploading 2 files, 1 plaintext, 1 gzipped, and both files are read using open, and then using cat_file:

This part works:

fs = gcsfs.GCSFileSystem(project='a')
a_file = 'same_path/a_test'
a_file_gz = 'same_path/a_test.gz'
with fs.open(a_file, 'wb') as f:
    f.write(b'abcd')
with fs.open(a_file_gz, 'wb', compression='gzip', fixed_key_metadata={'content_encoding': 'gzip'}) as f:
    f.write(b'abcd')
with fs.open(a_file, 'rb') as f:
    assert f.read() == b'abcd'
with fs.open(a_file_gz, 'rb') as f:
    assert f.read() == b'abcd'
assert bytes(fs.cat_file(a_file, 1, 3)) == b'bc'

this errors out

assert bytes(fs.cat_file(a_file_gz, 1, 3, fixed_key_metadata={'content_encoding': 'gzip'})) == b'bc'
assert bytes(fs.cat_file(a_file_gz, 1, 3)) == b'bc'

throwing this error

self = <StreamReader e=ClientPayloadError("400, message='Can not decode content-encoding: gzip'")>
n = -1

    async def read(self, n: int = -1) -> bytes:
        if self._exception is not None:
>           raise self._exception
E           aiohttp.client_exceptions.ClientPayloadError: 400, message='Can not decode content-encoding: gzip'

C:\tools\miniconda3\envs\filesystem-py39\lib\site-packages\aiohttp\streams.py:349: ClientPayloadError

my guess is because it's not passing the header in https://github.com/fsspec/gcsfs/blob/main/gcsfs/core.py#L859-L863

The text was updated successfully, but these errors were encountered:

martindurant · 2022-12-08T14:57:45Z

with fs.open(a_file_gz, 'wb', compression='gzip', fixed_key_metadata={'content_encoding': 'gzip'}) as f:
    f.write(b'abcd')

this is not correct. The content encoding is not the same as the MIME type, which would be "application/gzip". If you wanted to use content encoding like this, then the appropriate compression is actually none.

I don't exactly follow what your code snippet is trying to achieve: what behaviour are you after?

racinmat · 2022-12-08T15:51:18Z

I am trying to use the gzip transcoding https://cloud.google.com/storage/docs/transcoding
The documentation there literally says Content-Encoding: gzip

The code I have used properly encodes the data into gzip format, and on google cloud, it is stored in gzip-compressed format, and when I download it, it is automatically decompressed, as the documentation states, so I'm not sure why its not correct when it's doing what it should.
When I look at the object in the bucket browser, it shows correct encoding.

martindurant · 2022-12-08T15:55:14Z

Yes, but in that case, fsspec must not attempt to decompress it, because the transport library (aiohttp) should have done it already. Also note that the size reported for the file might be wrong.

racinmat · 2022-12-08T15:58:49Z

I know, but AFAIK fsspec does not attempt to decompress it, because the compression='gzip' is only at the 'wb', because GCP needs to obtain the data compressed, and does not compress it on its own, but during the rb' there is no compression, I am just adding the header, because without it the code throws error. And apparently it works, because the fsspec correctly obtains the decompressed data.

racinmat · 2022-12-09T12:09:16Z

I found out that if I read the whole file, it works. The size of gzipped file is 24 bytes.

assert fs.read_range(a_file_gz, 0, 23) == b'abcd'

But when I read only part it seems it does not work, and it looks like it tried to decode it.

racinmat · 2022-12-09T13:48:36Z

I found out the problem, it's in headers, I can replicate the error in curl

curl --location --request GET 'https://storage.googleapis.com/download/storage/v1/b/our-temp/o/tmp_bong%2Fa_test.gz?alt=media' \
--header '... \
--header 'Range: bytes=1-5' \
--header 'Accept-Encoding: gzip, deflate, br'

errors out, but when using only

```bash
curl --location --request GET 'https://storage.googleapis.com/download/storage/v1/b/our-temp/o/tmp_bong%2Fa_test.gz?alt=media' \
--header '... \
--header 'Range: bytes=1-5' \
--header 'Accept-Encoding: deflate, br'

without the gzip, it works and returns the whole contents according to the docs.
And there is no way to pass some custom Accept-Encoding header to the underlying GET call.

martindurant · 2022-12-09T13:52:19Z

This is not unexpected. You can only get specific offsets within the bytestream after decompression, this is a limitation of gzip. I expect the server is really returning the byte range you request out of the original compressed data, but that no longer is a valid gzip stream and so causes the error.
If you save your data as gzip, you cannot expect random access of uncompressed data.

racinmat · 2022-12-09T13:57:42Z

The server decompressed the data and returns the whole range. The GCP documentation, link I shared, states the whole file contents is returned, decoded.

martindurant · 2022-12-09T14:04:37Z

Well, in the first place it says that you shouldn't ever do this; in the second that the header key will be ignored. And thirdly we have found that the documentation is incorrect. I don't think there's anything gcsfs can do about this.

racinmat changed the title ~~cat_file of gzipped file does not work.~~ cat_file with start and end of gzipped file does not work. Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cat_file with start and end of gzipped file does not work. #512

cat_file with start and end of gzipped file does not work. #512

racinmat commented Dec 8, 2022 •

edited

Loading

martindurant commented Dec 8, 2022

racinmat commented Dec 8, 2022

martindurant commented Dec 8, 2022

racinmat commented Dec 8, 2022 •

edited

Loading

racinmat commented Dec 9, 2022

racinmat commented Dec 9, 2022 •

edited

Loading

martindurant commented Dec 9, 2022

racinmat commented Dec 9, 2022

martindurant commented Dec 9, 2022

cat_file with start and end of gzipped file does not work. #512

cat_file with start and end of gzipped file does not work. #512

Comments

racinmat commented Dec 8, 2022 • edited Loading

martindurant commented Dec 8, 2022

racinmat commented Dec 8, 2022

martindurant commented Dec 8, 2022

racinmat commented Dec 8, 2022 • edited Loading

racinmat commented Dec 9, 2022

racinmat commented Dec 9, 2022 • edited Loading

martindurant commented Dec 9, 2022

racinmat commented Dec 9, 2022

martindurant commented Dec 9, 2022

racinmat commented Dec 8, 2022 •

edited

Loading

racinmat commented Dec 8, 2022 •

edited

Loading

racinmat commented Dec 9, 2022 •

edited

Loading