Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding reader_options kwargs to open_virtual_dataset. #67

Merged
merged 37 commits into from
May 14, 2024

Conversation

norlandrhagen
Copy link
Collaborator

@norlandrhagen norlandrhagen commented Mar 29, 2024

In a step to start reading remote files #61, this PR adds in reader_options to the open_virtual_dataset function.

These reader_options are passed into each Kerchunk file reader (SingleHdf5ToZarr, NetCDF3ToZarr, etc..) in read_kerchunk_references_from_file. Once open_virtual_dataset is replaced with the Xarray backend, we could pass them through: ds = xr.open_dataset(fp, engine='virtualizarr', backend_kwargs={'reader_options': {'storage_options': {'anon': True}}}).

This approach relies on the user knowing what options are available in each Kerchunk file reader.

This example should work off of this PR pointing to a public s3 bucket.

from virtualizarr import open_virtual_dataset
path = 's3://carbonplan-share/virtualizarr/local.nc'

vds = open_virtual_dataset(path)

edit: Tests are passing. Index generation and filetype guessing are now working.

@norlandrhagen norlandrhagen changed the title Adding reader_options kwargs to open_virtual_dataset. [Draft] Adding reader_options kwargs to open_virtual_dataset. Mar 29, 2024
Copy link

codecov bot commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 16.66667% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 74.31%. Comparing base (f226093) to head (4c6cb63).
Report is 20 commits behind head on main.

❗ Current head 4c6cb63 differs from pull request most recent head e6f047f. Consider uploading reports for the commit e6f047f to get more accurate results

Files Patch % Lines
virtualizarr/tests/test_xarray.py 25.00% 6 Missing ⚠️
virtualizarr/kerchunk.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #67       +/-   ##
===========================================
- Coverage   90.18%   74.31%   -15.87%     
===========================================
  Files          14       14               
  Lines         998      946       -52     
===========================================
- Hits          900      703      -197     
- Misses         98      243      +145     
Flag Coverage Δ
unittests 74.31% <16.66%> (-15.87%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TomNicholas
Copy link
Member

@norlandrhagen I've merged main in here because @jbusecke is using this branch for testing with CMIP6 data in #93. It would be great to get this PR into main!

@norlandrhagen norlandrhagen requested a review from TomNicholas May 3, 2024 21:30
virtualizarr/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @norlandrhagen ! I think this is a nice minimal addition to support reading from s3. I like how the fsspec stuff is generally kept separate too.

virtualizarr/utils.py Outdated Show resolved Hide resolved
virtualizarr/utils.py Outdated Show resolved Hide resolved
@@ -27,6 +28,7 @@ def open_virtual_dataset(
loadable_variables: Optional[Iterable[str]] = None,
indexes: Optional[Mapping[str, Index]] = None,
virtual_array_class=ManifestArray,
reader_options: Optional[dict] = {'storage_options':{'key':'', 'secret':'', 'anon':True}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you normally point xr.open_dataset at an S3 url, you don't need to pass reader_options do you? Can we try to follow the signature of xr.open_dataset as closely as possible? (Maybe this already is as close as we can get)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but I thought you had to pass in some sort of fsspec/s3fs mapper.

For me this fails:

ds = xr.open_dataset('s3://carbonplan-share/virtualizarr/local.nc')

virtualizarr/tests/test_xarray.py Outdated Show resolved Hide resolved
virtualizarr/tests/test_xarray.py Outdated Show resolved Hide resolved
@norlandrhagen
Copy link
Collaborator Author

Looks like the docs + CI builds are failing with:

Installing pip dependencies: ...working... Pip subprocess error:
  Running command git clone --filter=blob:none --quiet https://github.com/TomNicholas/xarray.git /tmp/pip-install-rmq_i38f/xarray_61220b7470e0456fa17857a7af71f04b
  WARNING: Did not find branch or tag 'concat-avoid-index-auto-creation', assuming revision or ref.
  Running command git checkout -q concat-avoid-index-auto-creation
  error: pathspec 'concat-avoid-index-auto-creation' did not match any file(s) known to git
  error: subprocess-exited-with-error

@TomNicholas should we pin another version of Xarray?

@TomNicholas
Copy link
Member

TomNicholas commented May 9, 2024

should we pin another version of Xarray?

pydata/xarray#8872 was merged yesterday so now we should be able to release xarray, remove the xarray pin in virtualizarr, then release the first version of virtualizarr!

EDIT: Tracking the xarray release pydata/xarray#9018

@norlandrhagen norlandrhagen reopened this May 9, 2024
@norlandrhagen norlandrhagen changed the title [Draft] Adding reader_options kwargs to open_virtual_dataset. Adding reader_options kwargs to open_virtual_dataset. May 9, 2024
@TomNicholas
Copy link
Member

This seems very close now @norlandrhagen ?

@norlandrhagen
Copy link
Collaborator Author

CI is passing now @TomNicholas!

@TomNicholas
Copy link
Member

TomNicholas commented May 14, 2024

Thanks @norlandrhagen ! One final request: Can we add a quick explanatory line to the docs? Something like

To open remote files as virtual datasets, pass the reader_kwargs options, e.g.

vds = open_virtual_dataset("s3://fake-bucket/file.nc", reader_kwargs={whatever would be needed})

This would go on the usage page, either as a quick entry underneath the first Opening files as virtual datasets heading, or under a new heading at the bottom.

docs/usage.md Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member

Thank you so much @norlandrhagen ! Will merge this now

@TomNicholas TomNicholas merged commit 8923b8c into main May 14, 2024
5 checks passed
@TomNicholas TomNicholas deleted the reader_options branch May 14, 2024 17:08
@jbusecke
Copy link
Contributor

Fantastic. Thanks so much. Ill refactor my code in the coming days.

@ayushnag ayushnag mentioned this pull request May 23, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants