-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask Distributed Write Fix For Zarr #1079
Dask Distributed Write Fix For Zarr #1079
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1079 +/- ##
==========================================
- Coverage 82.70% 82.68% -0.02%
==========================================
Files 36 36
Lines 5116 5128 +12
==========================================
+ Hits 4231 4240 +9
- Misses 885 888 +3
|
is there a way to test this? |
@flying-sheep added tests, and it now fails for h5py :D. I thought I could just go with whatever xarray did at https://github.com/pydata/xarray/pull/1179/files. But now I realize this was from 2017 so I will need investigate a bit more closely. |
Hey @flying-sheep @ilan-gold, it should be ready for review once this test passes. I ended up not adding h5ad support for distributed writing and just giving an error. To fix we would need to create a file manager class that tracks the locks of a file etc. Xarray has a magic ability to serialize h5py datasets but how it does is inside the file manager class: pydata/xarray#4242. There is also a module called h5py pickle for this purpose but it isn't updated since 2020 https://github.com/DaanVanVugt/h5pickle |
Co-authored-by: Philipp A. <[email protected]>
Hey @flying-sheep could you perhaps merge this? I created an issue for h5py version here: #1105. But until I create a PR for that separately, I would like to create a notebook for concat_on_disk, which uses distributed writing for tests. |
@flying-sheep One sec, taking a look with selman |
Alright I think we can merge. But I want to follow up. Looking through xarray, it doesn't seem like they use this kind of solution for zarr, but do for hdf5. However here, we are using this solution for zarr and it does not work for hdf5. This is a bit disconcerting and a little concerning. I vaguely understand why this isn't working for us with hdf5, xarray has some handling for this as mentioned above. I do not understand what the are doing as to not need this for zarr. I will go ask xarray/ zarr people to take a look and see if they can suggest a better solution. |
Hi,
This issue appears when one tries to use dask distributed then write dask/distributed#780.
I did what xarray does for this https://github.com/pydata/xarray/pull/1179/files.
Not sure why the lock is global though for now.
Ping @ivirshup @flying-sheep