Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.1] Display binary data even if text data is expected #18502

Closed

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Jul 5, 2024

I've been going back and forth on whether we should raise an exception or not, however I think the advantage here is that often you can tell from the binary data what the correct datatype is (e.g. BAM files assigned as fastqsanger.gz will start with "BAM"). We could also raise a message exception that just says "this isn't text data, check your datatype" ... I don't know what's better, but this is easier.

Fixes
https://sentry.galaxyproject.org/share/issue/a8843884527f4e4089b32fd14a2f126d/:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 4: invalid start byte
  File "galaxy/web/framework/middleware/error.py", line 167, in __call__
    app_iter = self.application(environ, sr_checker)
  File "galaxy/web/framework/middleware/statsd.py", line 29, in __call__
    req = self.application(environ, start_response)
  File "/cvmfs/main.galaxyproject.org/venv/lib/python3.11/site-packages/paste/httpexceptions.py", line 635, in __call__
    return self.application(environ, start_response)
  File "galaxy/web/framework/base.py", line 174, in __call__
    return self.handle_request(request_id, path_info, environ, start_response)
  File "galaxy/web/framework/base.py", line 263, in handle_request
    body = method(trans, **kwargs)
  File "galaxy/webapps/galaxy/controllers/dataset.py", line 152, in display
    display_data, headers = data.datatype.display_data(
  File "galaxy/datatypes/sequence.py", line 785, in display_data
    "/dataset/large_file.mako", truncated_data=fh.read(max_peek_size), data=dataset
  File "<frozen codecs>", line 322, in decode

Which is a BAM file assigned to fastqsanger.gz

(Please replace this header with a description of your pull request. Please include BOTH what you did and why you made the changes. The "why" may simply be citing a relevant Galaxy issue.)
(If fixing a bug, please add any relevant error or traceback)
(For UI components, it is recommended to include screenshots or screencasts)

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

I've been going back and forth on whether we should raise an exception
or not, however I think the advantage here is that often you can tell
from the binary data what the correct datatype is (e.g. BAM files
assigned as fastqsanger.gz will start with "BAM").
We could also raise a message exception that just says "this isn't text
data, check your datatype" ... I don't know what's better, but this is
easier.

Fixes
https://sentry.galaxyproject.org/share/issue/a8843884527f4e4089b32fd14a2f126d/:
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 4: invalid start byte
  File "galaxy/web/framework/middleware/error.py", line 167, in __call__
    app_iter = self.application(environ, sr_checker)
  File "galaxy/web/framework/middleware/statsd.py", line 29, in __call__
    req = self.application(environ, start_response)
  File "/cvmfs/main.galaxyproject.org/venv/lib/python3.11/site-packages/paste/httpexceptions.py", line 635, in __call__
    return self.application(environ, start_response)
  File "galaxy/web/framework/base.py", line 174, in __call__
    return self.handle_request(request_id, path_info, environ, start_response)
  File "galaxy/web/framework/base.py", line 263, in handle_request
    body = method(trans, **kwargs)
  File "galaxy/webapps/galaxy/controllers/dataset.py", line 152, in display
    display_data, headers = data.datatype.display_data(
  File "galaxy/datatypes/sequence.py", line 785, in display_data
    "/dataset/large_file.mako", truncated_data=fh.read(max_peek_size), data=dataset
  File "<frozen codecs>", line 322, in decode

```
Which is a BAM file assigned to fastqsanger.gz
cursor = f.read(1)
while cursor and cursor != "\n":
while cursor and cursor != b"\n":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading single chars until you hit a newline was never ideal, but with a binary file this could potentially take a really long time, couldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg, yes, we're not reading from from the chunk ...

@wm75
Copy link
Contributor

wm75 commented Jul 5, 2024

Would it perhaps be better to catch the error and handle it via a common text datatype fallback?

@mvdbeek
Copy link
Member Author

mvdbeek commented Jul 5, 2024

via a common text datatype fallback?

I'm not sure what you mean by that ? Yes, probably better to raise an exception

@mvdbeek mvdbeek closed this Jul 5, 2024
@wm75
Copy link
Contributor

wm75 commented Jul 5, 2024

I meant catching the exception, then re-opening as binary file and providing a generic chunk of data as preview.
Raising the exception is also fine though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants