Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Distributed Write Fix For Zarr #1079

Merged
merged 15 commits into from
Aug 25, 2023

Conversation

selmanozleyen
Copy link
Member

Hi,
This issue appears when one tries to use dask distributed then write dask/distributed#780.
I did what xarray does for this https://github.com/pydata/xarray/pull/1179/files.

Not sure why the lock is global though for now.

Ping @ivirshup @flying-sheep

@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Merging #1079 (e5cd4a3) into main (5428f16) will decrease coverage by 0.02%.
The diff coverage is 84.61%.

❗ Current head e5cd4a3 differs from pull request most recent head 75dec29. Consider uploading reports for the commit 75dec29 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1079      +/-   ##
==========================================
- Coverage   82.70%   82.68%   -0.02%     
==========================================
  Files          36       36              
  Lines        5116     5128      +12     
==========================================
+ Hits         4231     4240       +9     
- Misses        885      888       +3     
Files Changed Coverage Δ
anndata/_io/specs/methods.py 87.46% <84.61%> (-0.13%) ⬇️

... and 1 file with indirect coverage changes

@flying-sheep
Copy link
Member

is there a way to test this?

@selmanozleyen
Copy link
Member Author

@flying-sheep added tests, and it now fails for h5py :D. I thought I could just go with whatever xarray did at https://github.com/pydata/xarray/pull/1179/files. But now I realize this was from 2017 so I will need investigate a bit more closely.

@selmanozleyen selmanozleyen marked this pull request as draft July 31, 2023 09:04
@selmanozleyen selmanozleyen marked this pull request as ready for review August 22, 2023 11:55
@selmanozleyen
Copy link
Member Author

selmanozleyen commented Aug 22, 2023

Hey @flying-sheep @ilan-gold, it should be ready for review once this test passes. I ended up not adding h5ad support for distributed writing and just giving an error. To fix we would need to create a file manager class that tracks the locks of a file etc. Xarray has a magic ability to serialize h5py datasets but how it does is inside the file manager class: pydata/xarray#4242. There is also a module called h5py pickle for this purpose but it isn't updated since 2020 https://github.com/DaanVanVugt/h5pickle

pyproject.toml Outdated Show resolved Hide resolved
anndata/tests/test_dask.py Outdated Show resolved Hide resolved
@selmanozleyen selmanozleyen changed the title Dask Distributed Write Fix Dask Distributed Write Fix For Zarr Aug 25, 2023
@selmanozleyen
Copy link
Member Author

Hey @flying-sheep could you perhaps merge this? I created an issue for h5py version here: #1105. But until I create a PR for that separately, I would like to create a notebook for concat_on_disk, which uses distributed writing for tests.

@flying-sheep flying-sheep added this to the 0.9.3 milestone Aug 25, 2023
@flying-sheep flying-sheep enabled auto-merge (squash) August 25, 2023 12:04
@ivirshup ivirshup disabled auto-merge August 25, 2023 12:09
@ivirshup
Copy link
Member

ivirshup commented Aug 25, 2023

@flying-sheep One sec, taking a look with selman

@ivirshup
Copy link
Member

Alright I think we can merge. But I want to follow up.

Looking through xarray, it doesn't seem like they use this kind of solution for zarr, but do for hdf5. However here, we are using this solution for zarr and it does not work for hdf5. This is a bit disconcerting and a little concerning.

I vaguely understand why this isn't working for us with hdf5, xarray has some handling for this as mentioned above.

I do not understand what the are doing as to not need this for zarr.

I will go ask xarray/ zarr people to take a look and see if they can suggest a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants