Removed dask pinning #570

LucaMarconato · 2024-05-24T17:09:28Z

Removing the dask pinning from pyproject.toml.

This PR also tests against Python 3.12.

for more information, see https://pre-commit.ci

LucaMarconato · 2024-05-24T17:43:30Z

Tracking the main differences, one per comment. The ticked comments are solved.

I changed the various from dask.dataframe.core import DataFrame as DaskDataFrame to from dask.dataframe import DataFrame as DaskDataFrame because the two import locations give different classes, and the second is correct, as shown in this screenshot.

LucaMarconato · 2024-05-24T17:51:34Z

df.attrs is not available anymore; I opened an issue here: Dask 2024.5.1 removed .attrs dask/dask#11146.

…o fix/dask

LucaMarconato · 2024-05-25T09:28:00Z

Changes required in spatialdata._io._utils._get_backing_files():
- df.dask.layers (a dict) is not available anymore for dask dataframes, df.dask is now directly the dict we need. x.dask.layers remains available for dask arrays.
- when search from a "read-parquet-"operation in the dask graph, the.parquet` file path needs to be extracted in a different way from the dask graph.

Edit. Additional comments:

.layers disappeared for dask dataframes even if the dask-expr backend is disabled.
I have refactored the function to retrieve the dask-backing files (_get_backing_files()); now it's more robust.

LucaMarconato · 2024-05-25T11:15:32Z

A bug appears when calling compute on a categorical columns; tracker here: Dask 2024.5.1 raises exception when .compute() is called on a categorical column dask/dask#11147.

LucaMarconato · 2024-05-25T12:35:57Z

Let's wait to see if there is a solution upstream for the two open points above; I think they would fix most/all the tests.

giovp · 2024-06-19T15:04:58Z

Thanks @LucaMarconato ! I wasn't aware of dask-expr removing the attrs argument, luckily it looks like they might reconsider?

LucaMarconato · 2024-06-19T16:27:47Z

In pandas .attrs is meant to stay: pandas-dev/pandas#52166 (comment).

I am trying working on a PR to restore .attrs in dask-expr, still wip (bunch of tests don't pass dask/dask-expr@main...LucaMarconato:dask-expr:support_attrs).

LucaMarconato · 2024-06-19T16:54:41Z

in _get_backing_files(), for DaskDataFrame we now have that items = element.dask.items() is a dictionary where keys are tuple[str, int] and not str as before. Fixed.

LucaMarconato · 2024-06-20T13:55:28Z

copy.deepcopy used to work on dask.dataframe.DataFrame, now it leads to objects that are broken (for instance calling print to a deepcopied object fails). Using the deepcopy from from spatialdata import deepcopy works. I replaced the only occurrence from where we were using copy.deepcopy with spatialdata.deepcopy.

for more information, see https://pre-commit.ci

…o fix/dask

codecov · 2024-06-24T20:10:37Z

Codecov Report

Attention: Patch coverage is 90.90909% with 9 lines in your changes missing coverage. Please review.

Project coverage is 91.91%. Comparing base (b8fbc5c) to head (82aba6c).
Report is 59 commits behind head on main.

Files with missing lines	Patch %	Lines
src/spatialdata/_io/_utils.py	84.21%	6 Missing ⚠️
src/spatialdata/__init__.py	80.00%	1 Missing ⚠️
src/spatialdata/_core/operations/transform.py	80.00%	1 Missing ⚠️
src/spatialdata/transformations/_utils.py	96.42%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #570      +/-   ##
==========================================
- Coverage   91.94%   91.91%   -0.04%     
==========================================
  Files          44       44              
  Lines        6608     6641      +33     
==========================================
+ Hits         6076     6104      +28     
- Misses        532      537       +5

Files with missing lines	Coverage Δ
src/spatialdata/_core/_deepcopy.py	`98.38% <100.00%> (ø)`
src/spatialdata/_core/_elements.py	`91.95% <100.00%> (ø)`
src/spatialdata/_core/centroids.py	`100.00% <100.00%> (ø)`
src/spatialdata/_core/data_extent.py	`97.93% <100.00%> (ø)`
src/spatialdata/_core/operations/_utils.py	`92.30% <ø> (ø)`
src/spatialdata/_core/operations/aggregate.py	`94.38% <100.00%> (ø)`
src/spatialdata/_core/operations/rasterize.py	`90.49% <100.00%> (ø)`
src/spatialdata/_core/operations/rasterize_bins.py	`89.15% <100.00%> (ø)`
src/spatialdata/_core/query/relational_query.py	`90.82% <100.00%> (ø)`
src/spatialdata/_core/query/spatial_query.py	`95.17% <100.00%> (ø)`
... and 7 more

... and 1 file with indirect coverage changes

LucaMarconato · 2024-06-24T22:26:54Z

Finally this PR is ready for review! Could any of you have a look please @giovp @melonora @kevinyamauchi?

This PR removes the pin for the dask dependency (which creates installation problems).

A few important comments:

The transition of dask in using dask-expr as the default backend of dask.dataframe.DataFrame had many implications for our codebase (listed above) leading to hundreds of tests failing. I have addressed all the problems except for one (around .attrs, I will talk about it below). Unfortunately I couldn't fix it and I had for the moment to disable the dask-expr backend in __init__.py and conftest.py.
The above means that some of the code changes in this PR would be appreciated only when the dask-expr is enabled. Nevertheless I decided to keep them in this PR because all the tests pass anyway and because in this way, when in the future we will re-enable the dask-expr backend, a smaller number of tests will remain to be fixed.

LucaMarconato · 2024-06-24T22:33:28Z

I have made the _get_backing_files() more robust, in particular adding code references to the Dask codebase when I extract the file paths from some nodes of the Dask graph. Please let me know if you have an question on that and I can provide more details. But probably the best is to run the code, set a breakpoint and have a look at the Dask graph.

Further comments:

I have found a weird bug with transformations (unrelated to this PR, but spotted during debugging), you can see the comment in

spatialdata/src/spatialdata/_core/operations/transform.py

Line 438 in d10a42e

# TODO: the following line, used in place of the line before, leads to an incorrect aggregation result. Look into
I will open a separate issue to track it.

LucaMarconato · 2024-06-24T23:06:39Z

My posts below are not needed for the review of this PR

df.attrs is not available anymore; I opened an issue here: dask/dask#11146.

Now some details regarding the behavior with .attrs that I didn't manage to address. I will write a preliminary explanation, but I plan in a few weeks to try to give a second stab at the problem and report it with accuracy to the Dask devs (edit 1 month later: had a discussion at EuroSciPy on this with pandas devs, it would be good to try to fix this in pandas, I won't have time now; edit 5 months later: I didn't find the time to give a second pass and ended up disabling the new optimizer by default. Getting back to this now because the old optimizer just got deprecated) CC @giovp @kevinyamauchi @melonora, and also @ivirshup.

Dask added support for .attrs with this small PR https://github.com/dask/dask/pull/6742/files. With dask-expr the attribute has been removed (Dask 2024.5.1 removed .attrs dask/dask#11146).
The reason why it has been removed it's that in pandas, the semantic for .attrs is not clearly defined, especially around copy-on-write (DEPR: attrs pandas-dev/pandas#52166), and this made it difficult to operate on it in dask-expr.
Naively, I thought that re-adding the support for it would have been as simple as adding an extra attribute to a container class, but it was not, for the following reasons (pointed out by the core devs):
- There is no stable container (denoted as collection in the codebase); instead the expressions are used to create the collections on the fly in many parts of the code;
- The underlying expression may change/reshuffle when .optimize() is called, and this function is called in several place, in particular in .compute().

I still thought that one could have easily stored the .attrs inside an expression and, in the case in which .optimize() is called, pickup any of the .attrs available (most of the time there would just be one attrs), but this is not possible because of how the class Expr is designed as I will now explain.

The class Expr https://github.com/dask/dask-expr/blob/main/dask_expr/_core.py#L45, if I understood it correctly, uses a class variable _instances that stores references to expressions by indexing them using an hash (see the _name property https://github.com/dask/dask-expr/blob/cb121cddb7fd4682a232ccfbf3927185d4f7465b/dask_expr/_core.py#L458). In dask, the .attrs is not used to compute this hash, which means that KEY point here:

if I initialize a Dask dataframe calling from_pandas(), and I set the .attrs values to something, and then I call again from_pandas() on a copy of the pandas dataframe, the newly returned Dask dataframe will have the same .attrs of the first object!

The implication of this is that when .optimize() is called, one can't simply pick-up any of the .attrs that is found in the new optimized graph, because the new dask graph may have subcomponent that existed before in Expr._instances (referring to old objects). Practically in the spatialdata codebase I tried this and I was getting dask dataframe (=points objects) with transformations belonging to entirely distinct objects.

I tried a bunch of experiments, here are some of them: changes in the dask repository to include .attrs in the computation of the hashes for Expr._instances. And here are some changes in the dask-expr repository to try to re-enable .attrs in dask-expr.

The tests that I created pass (they should live in dask, not in dask-expr but it was convenient to have them there), but some "real-world" tests in spatialdata still don't pass, because sometimes the .attrs get shared across different objects. This should not happen because I changed the hashes to include attrs, but maybe in pandas the .attrs are passed by reference instead of being copied sometimes, and this leads to the computation of an hash with the attrs from another object.

LucaMarconato · 2024-06-24T23:12:34Z

Here is some code that shows the problem (requires the dask and dask-expr branch that I linked above, and requires this PR (and removing the lines of code that disable the dask-expr backend, located in conftest.py and __init__.py).

In a few weeks I will try to make a shorter code example, independent from spatialdata (so I can share it with the dask devs).

import pandas as pd
from dask.dataframe import from_pandas
from dask.array import from_array
from spatialdata.transformations import Identity, get_transformation
from spatialdata import transform, SpatialData
from xarray import DataArray
import dask.array as da
from dask.dataframe import DataFrame as DaskDataFrame
from spatialdata.models import PointsModel

df = pd.DataFrame({"x": [1, 2, 3, 4, 1, 2], "y": [1, 2, 3, 4, 1, 2]})
ddf = from_pandas(df, npartitions=1)
ddf.attrs['transform'] = {'transformed': Identity()}
axes = ['x', 'y']

sdata = SpatialData.init_from_elements({'ddf': ddf})
sdata_transformed = sdata.transform_to_coordinate_system('transformed')

arrays = []
for ax in axes:
    arrays.append(ddf[ax].to_dask_array(lengths=True).reshape(-1, 1))
xdata = DataArray(da.concatenate(arrays, axis=1), coords={"points": range(len(ddf)), "dim": list(axes)})
transformed = ddf.drop(columns=list(axes)).copy()

# the weird part starts here
transformed.attrs = {'transform': {'global': Identity()}}
for ax in axes:
    indices = xdata["dim"] == ax
    new_ax = xdata[:, indices]
    transformed[ax] = new_ax.data.flatten()

if 'global' not in transformed.attrs['transform']:
    print('bug')

LucaMarconato · 2024-06-24T23:21:54Z

Finally here is a sketch of what I think could be a solution:

I think that first one should address the problems regarding .attrs in pandas, intuitively one should mimic what happens with .columns (which is basically metadata outside the dataframe body). When .columns is copied/passed by reference, the same should happen for .attrs. This could have performance implications, but if .attrs is lightweight (such as in our case) it should not be a problem.
after this one should include .attrs in the computation of the hash as I did in my branch of dask linked above. Additional comment: I think I read in some release notes of pandas that now .attrs needs to be JSON serializable, which means that a unique hash should be able to be computed all of the time.
Also, one should probably add at least 3-4 dask operations, so that modifications of .attrs are reflected in the computational graph: 1) .attrs is set, 2) .attrs is removed, 3) .attrs is read, 4) .attrs is modified externally (.columns are immutable, .attrs is not).
finally, one should add support for the above .attrs operations in dask-expr. Again, mimicking the way dask-expr deals with .columns may be the way to go.

I think that if the point 1 is addressed, the rest should be manageable. Curious to hear your comments on this.

giovp

looks great @LucaMarconato thanks! just a question on setting default, and whether it's ok to raise the warning every time the library is imported

giovp · 2024-06-25T07:06:44Z

src/spatialdata/__init__.py

@@ -1,5 +1,17 @@
 from __future__ import annotations

+import dask
+
+dask.config.set({"dataframe.query-planning": False})


I don't get it, it's set to false here, but then once dask.dataframe is imported, is set again to True?

It's not obvious, I'll add a comment to make it clear. What can happen is that the user imports dask.dataframe before importing this file/setting dataframe.query-planning to False. In that case DASK_EXPR_ENABLED would be True, but we don't want that. So we add this extra check.

giovp · 2024-06-25T07:11:21Z

src/spatialdata/transformations/_utils.py

+    from spatialdata.models._utils import TRANSFORM_KEY
+
+    if TRANSFORM_KEY in e.attrs:
+        raise ValueError(


this is a great check btw, I think it can close this issue: #576

for more information, see https://pre-commit.ci

LucaMarconato · 2024-06-25T13:29:52Z

A comment on my latest commits. I don't particularly like the fact that in _search_for_backing_files_recursively() there are some code branches that are non obvious. Precisely we have that:

orignal-from-zarr is what we look for, in the dask graph, when searching for the files that are "backing" the raster data.
read-parquet and read_parquet are the respective names that we look for when the data is stored in parquet files. If dask-expr is not enabled then we have read-parquet, otherwise we have read_parquet.
after finding read-parquet/read_parquet, the data is either stored into creation_info either into a tuple v[1]['piece'][0]. I have linked to the dask code where this is specified.

While original-from-zarr is occurring all the times, for instance in my install (latest dask, dask-expr disabled), I fall into the case: read-parquet + piece.

I don't like the fact that it's difficult to test all these cases since they depend on the specific versions installed, and also that it's possible that these patterns will change over time when new versions of dask are released.

I have tried adding some tests for _search_for_backing_files_recursively() (the function that searches for the file paths in the dask graphs) to cover the various cases (each time I have to install a different dask version and pickle a file), but this approach is also not optimal because we start getting opaque pickle files.

I therefore propose the following:

we support only recent versions of Dask (I have set a new minimum in the requirements)
We could consider asking the Dask devs to provide an API replacing get_dask_backing_files(), i.e. which an API to search for standard file-read operations from common formats (such as Zarr, Parquet, etc.).

Nevertheless, the currently implemented behavior is safe and ready to merge.

removed dask pinning

98a5af7

LucaMarconato mentioned this pull request May 24, 2024

fix: citation, relational query index #569

Merged

2 tasks

[pre-commit.ci] auto fixes from pre-commit.com hooks

34c8604

for more information, see https://pre-commit.ci

LucaMarconato added 2 commits May 24, 2024 19:52

fixed top level error tests

e09b3f5

Merge branch 'fix/dask' of https://github.com/scverse/spatialdata int…

1271c4c

…o fix/dask

LucaMarconato mentioned this pull request May 25, 2024

Dask 2024.5.1 raises exception when .compute() is called on a categorical column dask/dask#11147

Closed

wip

14db6b4

LucaMarconato added 2 commits May 25, 2024 14:41

testing against 3.12

a7de2bb

Merge branch 'main' into fix/dask

c4e5bbb

fix _get_backing_files; wip fix .attrs

79c7954

LucaMarconato and others added 7 commits June 21, 2024 15:18

wip

152dd4d

[pre-commit.ci] auto fixes from pre-commit.com hooks

7072319

for more information, see https://pre-commit.ci

wip

8a11ed6

Merge branch 'fix/dask' of https://github.com/scverse/spatialdata int…

3735160

…o fix/dask

Merge branch 'main' into fix/dask

687429e

wip

8f7ee8e

Merge branch 'main' into fix/dask

25d9bd7

LucaMarconato mentioned this pull request Jun 24, 2024

(feat): SpatialData wrapper vitessce/vitessce-python#333

Closed

2 tasks

all tests should be fixed (dask-expr installed, but backend disabled)

20c3f24

attempt fix python3.12 tests

832becb

LucaMarconato marked this pull request as ready for review June 24, 2024 20:25

LucaMarconato added 2 commits June 25, 2024 00:07

improved search for dask backing files

fd6c875

fix test backing files

d10a42e

giovp approved these changes Jun 25, 2024

View reviewed changes

giovp reviewed Jun 25, 2024

View reviewed changes

LucaMarconato and others added 6 commits June 25, 2024 14:07

add comment

e0bfd71

fix ruff

f424b3f

[pre-commit.ci] auto fixes from pre-commit.com hooks

2dec648

for more information, see https://pre-commit.ci

wip pickled dask graphs for tests

37e2fdc

removing pickled tests; requiring minimum version of dask instead

5906794

forgot to re-disable dask-expr

82aba6c

LucaMarconato merged commit 04d7782 into main Jun 25, 2024
8 checks passed

LucaMarconato deleted the fix/dask branch June 25, 2024 15:46

ibimanji mentioned this pull request Sep 7, 2024

error in bt.pl.shape_stats, bt.pl.shape_stats,bt.pl.flux, bt.pl.fluxmap YeoLab/bento-tools#151

Closed

LucaMarconato mentioned this pull request Sep 8, 2024

Adjust for dask expressions update #482

Closed

LucaMarconato mentioned this pull request Nov 13, 2024

Dask 2024.5.1 removed .attrs dask/dask#11146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed dask pinning #570

Removed dask pinning #570

LucaMarconato commented May 24, 2024 •

edited

Loading

LucaMarconato commented May 24, 2024 •

edited

Loading

LucaMarconato commented May 24, 2024 •

edited

Loading

LucaMarconato commented May 25, 2024 •

edited

Loading

LucaMarconato commented May 25, 2024 •

edited

Loading

LucaMarconato commented May 25, 2024

giovp commented Jun 19, 2024

LucaMarconato commented Jun 19, 2024

LucaMarconato commented Jun 19, 2024

LucaMarconato commented Jun 20, 2024

codecov bot commented Jun 24, 2024 •

edited

Loading

LucaMarconato commented Jun 24, 2024

LucaMarconato commented Jun 24, 2024 •

edited

Loading

LucaMarconato commented Jun 24, 2024 •

edited

Loading

LucaMarconato commented Jun 24, 2024

LucaMarconato commented Jun 24, 2024

giovp left a comment

giovp Jun 25, 2024

LucaMarconato Jun 25, 2024

giovp Jun 25, 2024

LucaMarconato commented Jun 25, 2024 •

edited

Loading

Removed dask pinning #570

Removed dask pinning #570

Conversation

LucaMarconato commented May 24, 2024 • edited Loading

LucaMarconato commented May 24, 2024 • edited Loading

LucaMarconato commented May 24, 2024 • edited Loading

LucaMarconato commented May 25, 2024 • edited Loading

LucaMarconato commented May 25, 2024 • edited Loading

LucaMarconato commented May 25, 2024

giovp commented Jun 19, 2024

LucaMarconato commented Jun 19, 2024

LucaMarconato commented Jun 19, 2024

LucaMarconato commented Jun 20, 2024

codecov bot commented Jun 24, 2024 • edited Loading

Codecov Report

LucaMarconato commented Jun 24, 2024

LucaMarconato commented Jun 24, 2024 • edited Loading

LucaMarconato commented Jun 24, 2024 • edited Loading

My posts below are not needed for the review of this PR

LucaMarconato commented Jun 24, 2024

LucaMarconato commented Jun 24, 2024

giovp left a comment

Choose a reason for hiding this comment

giovp Jun 25, 2024

Choose a reason for hiding this comment

LucaMarconato Jun 25, 2024

Choose a reason for hiding this comment

giovp Jun 25, 2024

Choose a reason for hiding this comment

LucaMarconato commented Jun 25, 2024 • edited Loading

LucaMarconato commented May 24, 2024 •

edited

Loading

LucaMarconato commented May 24, 2024 •

edited

Loading

LucaMarconato commented May 24, 2024 •

edited

Loading

LucaMarconato commented May 25, 2024 •

edited

Loading

LucaMarconato commented May 25, 2024 •

edited

Loading

codecov bot commented Jun 24, 2024 •

edited

Loading

LucaMarconato commented Jun 24, 2024 •

edited

Loading

LucaMarconato commented Jun 24, 2024 •

edited

Loading

LucaMarconato commented Jun 25, 2024 •

edited

Loading