Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in RT of parquet detection #278

Merged
merged 5 commits into from
Nov 4, 2024
Merged

Fix bug in RT of parquet detection #278

merged 5 commits into from
Nov 4, 2024

Conversation

norlandrhagen
Copy link
Collaborator

No description provided.

Copy link

codecov bot commented Oct 28, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 67.92%. Comparing base (fffdc2d) to head (56cb1c4).

Files with missing lines Patch % Lines
virtualizarr/readers/kerchunk.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #278   +/-   ##
=======================================
  Coverage   67.92%   67.92%           
=======================================
  Files          41       41           
  Lines        2516     2516           
=======================================
  Hits         1709     1709           
  Misses        807      807           
Flag Coverage Δ
unittests 67.92% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@norlandrhagen
Copy link
Collaborator Author

Typing: #274

@@ -38,7 +38,7 @@ def open_virtual_dataset(
fs = _FsspecFSFromFilepath(filepath=filepath, reader_options=reader_options)

# The kerchunk .parquet storage format isn't actually a parquet, but a directory that contains named parquets for each group/variable.
if fs.filepath.endswith("ref.parquet"):
if fs.filepath.endswith(".parquet"):
Copy link
Contributor

@keewis keewis Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to fsspec/kerchunk#519 (comment), you could also check for the presence of f"{fs.filepath}/.zmetadata", which should avoid trying to open actual parquet files – the only downside is a additional request, but since we already check magic bytes elsewhere I don't think that's a concern:

Suggested change
if fs.filepath.endswith(".parquet"):
if fs.filepath.endswith(".parquet") and fs.fs.isfile(f"{fs.filepath}/.zmetadata"):

(or define isfile / exists on _FsspecFSFromFilepath and use that instead of fs.fs.isfile)

Also, do we want to support .parq, as well, or do we expect people to always end the filename with .parquet? If we do want to support .parq, you can use fs.filepath.endswith((".parquet", ".parq")) (not that I'd need that myself).

cc @martindurant

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @keewis!

Also, do we want to support .parq, as well, or do we expect people to always end the filename with .parquet? If we do want to support .parq, you can use fs.filepath.endswith((".parquet", ".parq")) (not that I'd need that myself).

I don't have really strong opinions on this. Adding more file name guessing doesn't feel great to me. I could see people naming files with a .pqt suffix since it's really just a directory name we're looking at. Maybe we just stick with .parquet and mention in the docs and/or in the ValueError?

cc @TomNicholas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much of an opinion. Explicit is better than guessing, but clear documentation and error message either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both! I added a bit to the error message.

Co-authored-by: Tom Nicholas <[email protected]>
@TomNicholas TomNicholas merged commit ab23caa into main Nov 4, 2024
10 of 11 checks passed
@TomNicholas TomNicholas deleted the RT_kerchunk_bug branch November 4, 2024 17:54
@norlandrhagen norlandrhagen mentioned this pull request Dec 9, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants