Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed dask pinning #570

Merged
merged 25 commits into from
Jun 25, 2024
Merged

Removed dask pinning #570

merged 25 commits into from
Jun 25, 2024

Conversation

LucaMarconato
Copy link
Member

@LucaMarconato LucaMarconato commented May 24, 2024

Removing the dask pinning from pyproject.toml.

This PR also tests against Python 3.12.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented May 24, 2024

Tracking the main differences, one per comment. The ticked comments are solved.

  • I changed the various from dask.dataframe.core import DataFrame as DaskDataFrame to from dask.dataframe import DataFrame as DaskDataFrame because the two import locations give different classes, and the second is correct, as shown in this screenshot.
image

@LucaMarconato
Copy link
Member Author

LucaMarconato commented May 24, 2024

@LucaMarconato
Copy link
Member Author

LucaMarconato commented May 25, 2024

  • Changes required in spatialdata._io._utils._get_backing_files():
    • df.dask.layers (a dict) is not available anymore for dask dataframes, df.dask is now directly the dict we need. x.dask.layers remains available for dask arrays.
    • when search from a "read-parquet-"operation in the dask graph, the.parquet` file path needs to be extracted in a different way from the dask graph.

Edit. Additional comments:

  • .layers disappeared for dask dataframes even if the dask-expr backend is disabled.
  • I have refactored the function to retrieve the dask-backing files (_get_backing_files()); now it's more robust.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented May 25, 2024

@LucaMarconato
Copy link
Member Author

Let's wait to see if there is a solution upstream for the two open points above; I think they would fix most/all the tests.

@giovp
Copy link
Member

giovp commented Jun 19, 2024

Thanks @LucaMarconato ! I wasn't aware of dask-expr removing the attrs argument, luckily it looks like they might reconsider?

@LucaMarconato
Copy link
Member Author

In pandas .attrs is meant to stay: pandas-dev/pandas#52166 (comment).

I am trying working on a PR to restore .attrs in dask-expr, still wip (bunch of tests don't pass dask/dask-expr@main...LucaMarconato:dask-expr:support_attrs).

@LucaMarconato
Copy link
Member Author

  • in _get_backing_files(), for DaskDataFrame we now have that items = element.dask.items() is a dictionary where keys are tuple[str, int] and not str as before. Fixed.

@LucaMarconato
Copy link
Member Author

  • copy.deepcopy used to work on dask.dataframe.DataFrame, now it leads to objects that are broken (for instance calling print to a deepcopied object fails). Using the deepcopy from from spatialdata import deepcopy works. I replaced the only occurrence from where we were using copy.deepcopy with spatialdata.deepcopy.

Copy link

codecov bot commented Jun 24, 2024

Codecov Report

Attention: Patch coverage is 90.90909% with 9 lines in your changes missing coverage. Please review.

Project coverage is 91.91%. Comparing base (b8fbc5c) to head (82aba6c).
Report is 59 commits behind head on main.

Files with missing lines Patch % Lines
src/spatialdata/_io/_utils.py 84.21% 6 Missing ⚠️
src/spatialdata/__init__.py 80.00% 1 Missing ⚠️
src/spatialdata/_core/operations/transform.py 80.00% 1 Missing ⚠️
src/spatialdata/transformations/_utils.py 96.42% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #570      +/-   ##
==========================================
- Coverage   91.94%   91.91%   -0.04%     
==========================================
  Files          44       44              
  Lines        6608     6641      +33     
==========================================
+ Hits         6076     6104      +28     
- Misses        532      537       +5     
Files with missing lines Coverage Δ
src/spatialdata/_core/_deepcopy.py 98.38% <100.00%> (ø)
src/spatialdata/_core/_elements.py 91.95% <100.00%> (ø)
src/spatialdata/_core/centroids.py 100.00% <100.00%> (ø)
src/spatialdata/_core/data_extent.py 97.93% <100.00%> (ø)
src/spatialdata/_core/operations/_utils.py 92.30% <ø> (ø)
src/spatialdata/_core/operations/aggregate.py 94.38% <100.00%> (ø)
src/spatialdata/_core/operations/rasterize.py 90.49% <100.00%> (ø)
src/spatialdata/_core/operations/rasterize_bins.py 89.15% <100.00%> (ø)
src/spatialdata/_core/query/relational_query.py 90.82% <100.00%> (ø)
src/spatialdata/_core/query/spatial_query.py 95.17% <100.00%> (ø)
... and 7 more

... and 1 file with indirect coverage changes

@LucaMarconato LucaMarconato marked this pull request as ready for review June 24, 2024 20:25
@LucaMarconato
Copy link
Member Author

Finally this PR is ready for review! Could any of you have a look please @giovp @melonora @kevinyamauchi?

This PR removes the pin for the dask dependency (which creates installation problems).

A few important comments:

  • The transition of dask in using dask-expr as the default backend of dask.dataframe.DataFrame had many implications for our codebase (listed above) leading to hundreds of tests failing. I have addressed all the problems except for one (around .attrs, I will talk about it below). Unfortunately I couldn't fix it and I had for the moment to disable the dask-expr backend in __init__.py and conftest.py.
  • The above means that some of the code changes in this PR would be appreciated only when the dask-expr is enabled. Nevertheless I decided to keep them in this PR because all the tests pass anyway and because in this way, when in the future we will re-enable the dask-expr backend, a smaller number of tests will remain to be fixed.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented Jun 24, 2024

  • I have made the _get_backing_files() more robust, in particular adding code references to the Dask codebase when I extract the file paths from some nodes of the Dask graph. Please let me know if you have an question on that and I can provide more details. But probably the best is to run the code, set a breakpoint and have a look at the Dask graph.

Further comments:

  • I have found a weird bug with transformations (unrelated to this PR, but spotted during debugging), you can see the comment in
    # TODO: the following line, used in place of the line before, leads to an incorrect aggregation result. Look into
  • I will open a separate issue to track it.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented Jun 24, 2024

My posts below are not needed for the review of this PR

df.attrs is not available anymore; I opened an issue here: dask/dask#11146.

Now some details regarding the behavior with .attrs that I didn't manage to address. I will write a preliminary explanation, but I plan in a few weeks to try to give a second stab at the problem and report it with accuracy to the Dask devs (edit 1 month later: had a discussion at EuroSciPy on this with pandas devs, it would be good to try to fix this in pandas, I won't have time now; edit 5 months later: I didn't find the time to give a second pass and ended up disabling the new optimizer by default. Getting back to this now because the old optimizer just got deprecated) CC @giovp @kevinyamauchi @melonora, and also @ivirshup.

  • Dask added support for .attrs with this small PR https://github.com/dask/dask/pull/6742/files. With dask-expr the attribute has been removed (Dask 2024.5.1 removed .attrs dask/dask#11146).
  • The reason why it has been removed it's that in pandas, the semantic for .attrs is not clearly defined, especially around copy-on-write (DEPR: attrs pandas-dev/pandas#52166), and this made it difficult to operate on it in dask-expr.
  • Naively, I thought that re-adding the support for it would have been as simple as adding an extra attribute to a container class, but it was not, for the following reasons (pointed out by the core devs):
    • There is no stable container (denoted as collection in the codebase); instead the expressions are used to create the collections on the fly in many parts of the code;
    • The underlying expression may change/reshuffle when .optimize() is called, and this function is called in several place, in particular in .compute().

I still thought that one could have easily stored the .attrs inside an expression and, in the case in which .optimize() is called, pickup any of the .attrs available (most of the time there would just be one attrs), but this is not possible because of how the class Expr is designed as I will now explain.

The class Expr https://github.com/dask/dask-expr/blob/main/dask_expr/_core.py#L45, if I understood it correctly, uses a class variable _instances that stores references to expressions by indexing them using an hash (see the _name property https://github.com/dask/dask-expr/blob/cb121cddb7fd4682a232ccfbf3927185d4f7465b/dask_expr/_core.py#L458). In dask, the .attrs is not used to compute this hash, which means that KEY point here:

if I initialize a Dask dataframe calling from_pandas(), and I set the .attrs values to something, and then I call again from_pandas() on a copy of the pandas dataframe, the newly returned Dask dataframe will have the same .attrs of the first object!

The implication of this is that when .optimize() is called, one can't simply pick-up any of the .attrs that is found in the new optimized graph, because the new dask graph may have subcomponent that existed before in Expr._instances (referring to old objects). Practically in the spatialdata codebase I tried this and I was getting dask dataframe (=points objects) with transformations belonging to entirely distinct objects.

I tried a bunch of experiments, here are some of them: changes in the dask repository to include .attrs in the computation of the hashes for Expr._instances. And here are some changes in the dask-expr repository to try to re-enable .attrs in dask-expr.

The tests that I created pass (they should live in dask, not in dask-expr but it was convenient to have them there), but some "real-world" tests in spatialdata still don't pass, because sometimes the .attrs get shared across different objects. This should not happen because I changed the hashes to include attrs, but maybe in pandas the .attrs are passed by reference instead of being copied sometimes, and this leads to the computation of an hash with the attrs from another object.

@LucaMarconato
Copy link
Member Author

Here is some code that shows the problem (requires the dask and dask-expr branch that I linked above, and requires this PR (and removing the lines of code that disable the dask-expr backend, located in conftest.py and __init__.py).

In a few weeks I will try to make a shorter code example, independent from spatialdata (so I can share it with the dask devs).

import pandas as pd
from dask.dataframe import from_pandas
from dask.array import from_array
from spatialdata.transformations import Identity, get_transformation
from spatialdata import transform, SpatialData
from xarray import DataArray
import dask.array as da
from dask.dataframe import DataFrame as DaskDataFrame
from spatialdata.models import PointsModel

df = pd.DataFrame({"x": [1, 2, 3, 4, 1, 2], "y": [1, 2, 3, 4, 1, 2]})
ddf = from_pandas(df, npartitions=1)
ddf.attrs['transform'] = {'transformed': Identity()}
axes = ['x', 'y']

sdata = SpatialData.init_from_elements({'ddf': ddf})
sdata_transformed = sdata.transform_to_coordinate_system('transformed')

arrays = []
for ax in axes:
    arrays.append(ddf[ax].to_dask_array(lengths=True).reshape(-1, 1))
xdata = DataArray(da.concatenate(arrays, axis=1), coords={"points": range(len(ddf)), "dim": list(axes)})
transformed = ddf.drop(columns=list(axes)).copy()

# the weird part starts here
transformed.attrs = {'transform': {'global': Identity()}}
for ax in axes:
    indices = xdata["dim"] == ax
    new_ax = xdata[:, indices]
    transformed[ax] = new_ax.data.flatten()

if 'global' not in transformed.attrs['transform']:
    print('bug')

@LucaMarconato
Copy link
Member Author

Finally here is a sketch of what I think could be a solution:

  1. I think that first one should address the problems regarding .attrs in pandas, intuitively one should mimic what happens with .columns (which is basically metadata outside the dataframe body). When .columns is copied/passed by reference, the same should happen for .attrs. This could have performance implications, but if .attrs is lightweight (such as in our case) it should not be a problem.
  2. after this one should include .attrs in the computation of the hash as I did in my branch of dask linked above. Additional comment: I think I read in some release notes of pandas that now .attrs needs to be JSON serializable, which means that a unique hash should be able to be computed all of the time.
  3. Also, one should probably add at least 3-4 dask operations, so that modifications of .attrs are reflected in the computational graph: 1) .attrs is set, 2) .attrs is removed, 3) .attrs is read, 4) .attrs is modified externally (.columns are immutable, .attrs is not).
  4. finally, one should add support for the above .attrs operations in dask-expr. Again, mimicking the way dask-expr deals with .columns may be the way to go.

I think that if the point 1 is addressed, the rest should be manageable. Curious to hear your comments on this.

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great @LucaMarconato thanks! just a question on setting default, and whether it's ok to raise the warning every time the library is imported

@@ -1,5 +1,17 @@
from __future__ import annotations

import dask

dask.config.set({"dataframe.query-planning": False})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it, it's set to false here, but then once dask.dataframe is imported, is set again to True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious, I'll add a comment to make it clear. What can happen is that the user imports dask.dataframe before importing this file/setting dataframe.query-planning to False. In that case DASK_EXPR_ENABLED would be True, but we don't want that. So we add this extra check.

from spatialdata.models._utils import TRANSFORM_KEY

if TRANSFORM_KEY in e.attrs:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great check btw, I think it can close this issue: #576

@LucaMarconato
Copy link
Member Author

LucaMarconato commented Jun 25, 2024

A comment on my latest commits. I don't particularly like the fact that in _search_for_backing_files_recursively() there are some code branches that are non obvious. Precisely we have that:

  • orignal-from-zarr is what we look for, in the dask graph, when searching for the files that are "backing" the raster data.
  • read-parquet and read_parquet are the respective names that we look for when the data is stored in parquet files. If dask-expr is not enabled then we have read-parquet, otherwise we have read_parquet.
  • after finding read-parquet/read_parquet, the data is either stored into creation_info either into a tuple v[1]['piece'][0]. I have linked to the dask code where this is specified.

While original-from-zarr is occurring all the times, for instance in my install (latest dask, dask-expr disabled), I fall into the case: read-parquet + piece.

I don't like the fact that it's difficult to test all these cases since they depend on the specific versions installed, and also that it's possible that these patterns will change over time when new versions of dask are released.

I have tried adding some tests for _search_for_backing_files_recursively() (the function that searches for the file paths in the dask graphs) to cover the various cases (each time I have to install a different dask version and pickle a file), but this approach is also not optimal because we start getting opaque pickle files.

I therefore propose the following:

  • we support only recent versions of Dask (I have set a new minimum in the requirements)
  • We could consider asking the Dask devs to provide an API replacing get_dask_backing_files(), i.e. which an API to search for standard file-read operations from common formats (such as Zarr, Parquet, etc.).

Nevertheless, the currently implemented behavior is safe and ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants