Arrays #471

mrocklin · 2023-12-06T01:03:00Z

This implements a skeleton version of dask.array. It currently includes the following:

But there are many missing gaps. Some notabe omissions:

Don't combine slicing and from_array, so getting a single row isn't as fast as it could be
blockwise/elemwise operations need to be filled out. This is easy, but things like add are built and sub is not.
Same for reductions. All I've done is sum
There are various cut corners. For example in slicing I pulled out slicing by a Dask array and in reductions I've not handled reductions that require chunks or meta to be rewritten
blockwise fusion isn't implemented
reductions could probably use another layer in the class hierarchy and some lowering (we immediately generate some implementation-specific stuff)
We'll maybe need to think about combine-similar for rechunking, similar to what we did for dataframes

In general though I think that the majority (60%?) of the very hard work is done.

Examples

import dask_expr.array as da
x = da.random.random((10000, 10000))
y = (2 * x + x.T)[::10]

y.pprint()

Slice: index=(slice(None, None, 10), slice(None, None, None))
  Elemwise: op=<built-in function add>
    Elemwise: op=<built-in function mul>2
      Random: rng=<dask_expr.array.random.RandomState object at 0x143fc2c90> distribution='random_sample' size=(10000, 10000) chunks='auto' args=() kwargs={}
    Transpose: axes=(1, 0)
      Random: rng=<dask_expr.array.random.RandomState object at 0x143fc2c90> distribution='random_sample' size=(10000, 10000) chunks='auto' args=() kwargs={}

y.pprint()

Elemwise: op=<built-in function add>
  Elemwise: op=<built-in function mul>2
    Slice: index=(slice(None, None, 10), slice(None, None, None))
      Random: rng=<dask_expr.array.random.RandomState object at 0x143fc2c90> distribution='random_sample' size=(10000, 10000) chunks='auto' args=() kwargs={}
  Transpose: axes=(1, 0)
    Slice: index=(slice(None, None, None), slice(None, None, 10))
      Random: rng=<dask_expr.array.random.RandomState object at 0x143fc2c90> distribution='random_sample' size=(10000, 10000) chunks='auto' args=() kwargs={}

y.sum().compute()

15004283.422871998

y = (x + 1).T.rechunk((10000, 10))
y.pprint()

Rechunk: _chunks=(10000, 10) balance=False
  Transpose: axes=(1, 0)
    Elemwise: op=<built-in function add>1
      Random: rng=<dask_expr.array.random.RandomState object at 0x143fc2c90> distribution='random_sample' size=(10000, 10000) chunks='auto' args=() kwargs={}

y.optimize().pprint()

Transpose: axes=(1, 0)
  Elemwise: op=<built-in function add>1
    Random: rng=<dask_expr.array.random.RandomState object at 0x143fc2c90> distribution='random_sample' size=(10000, 10000) chunks=(10, 10000) args=() kwargs={}

Closes dask#142 Supercedes dask#158

mrocklin · 2023-12-06T01:15:34Z

Depends on dask/dask#10676

Also, no rush on this PR. I think that this is likely to just be a working branch for a while. I would like to get #470 in though if we can before that goes stale (which is likely to happen quickly I suspect).

…e-core-expr

mrocklin · 2023-12-08T14:44:46Z

OK, I went to add blockwise, but needed unify_chunks, which needed rechunk. I've added rechunk (including our first optimization!)

mrocklin · 2023-12-08T21:48:50Z

OK, this does trivial things now:

In [1]: import numpy as np, dask_expr.array as da

In [2]: x = da.from_array(np.random.random((1000, 1000)))

In [3]: y = da.from_array(np.random.random((1000)))

In [4]: z = (x + y).rechunk((500, 200))

In [5]: z.pprint()
Rechunk: _chunks=(500, 200) balance=False
  Elemwise: func=<built-in function add> out_ind=(1, 0) token='add' dtype=dtype('float64') new_axes={} kwargs={}(1, 0)(0,)
    FromArray: array='<array>' chunks='auto'
    FromArray: array='<array>' chunks='auto'

In [6]: z.optimize().pprint()
Elemwise: func=<built-in function add> out_ind=(1, 0) token='add' dtype=dtype('float64') new_axes={} kwargs={}(1, 0)(0,)
  FromArray: array='<array>' chunks=(500, 200)
  FromArray: array='<array>' chunks=(200,)

In [7]: x.T.T.pprint()
Transpose: axes=(1, 0)
  Transpose: axes=(1, 0)
    FromArray: array='<array>' chunks='auto'

In [8]: x.T.T.optimize().pprint()
FromArray: array='<array>' chunks='auto'

cc @dcherian @TomNicholas @jhamman

Still very broken, but hopefully enough of a prop for conversation at AGU

mrocklin · 2023-12-09T01:27:24Z

dask_expr/array/tests/test_array.py

+
+    x = (xr.DataArray(b, dims=["x", "y"]) + 1).chunk(x=2)
+
+    assert x.data.optimize()._name == (da.from_array(a, chunks={0: 2}) + 1)._name


FYI @dcherian the xarray thing we were playing with works now

mrocklin · 2023-12-10T00:00:20Z

OK, I've implemented rudimentary versions of blockwise, slicing, rechunking, reductions, random, and from_array. The opening comment has been updated. This should suffice for POCs. I hope to be able to nerd snipe someone to flesh out this skeleton.

mrocklin · 2023-12-15T17:46:50Z

@fjetter @phofl we'll need to figure out how/if we should review this and eventually merge.

Priority is certainly dataframes, and I don't want to upset the momentum there. I'd also like to avoid this sitting stale for a long time. Or maybe that's best. Better for things to go stale in a PR maybe rather than go stale in main.

dcherian · 2023-12-17T02:05:03Z

Mind adding an automatic .simplify() when calling .compute()? That will help some experiments with Xarray

dask_expr/array/core.py

mrocklin · 2023-12-18T22:41:49Z

I've added support for general numpy ufuncs (not gufuncs) and filled out the reductions and operators a little

phofl · 2024-06-25T19:47:48Z

we merged the rebase

mrocklin · 2024-06-25T19:49:34Z

Woot. Thanks all

…

On Tue, Jun 25, 2024 at 3:48 PM Patrick Hoefler ***@***.***> wrote: we merged the rebase — Reply to this email directly, view it on GitHub <#471 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTBHWBZIKGOZ6PPBIIDZJHCPVAVCNFSM6AAAAABAISOWPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHA2DMNRXHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

dcherian · 2024-06-25T19:55:29Z

Exciting!

mrocklin · 2024-06-25T20:10:21Z

@dcherian just to be clear, this implementation is broken and would produce wrong results in many cases. Please do not use today. This is still very much a work in progress.

dcherian · 2024-06-25T20:52:30Z

👍🏾 excited to see movement that's all

mrocklin added 3 commits December 5, 2023 15:02

Split out graph Expr code from Dataframe Expr code

0e411ae

Closes dask#142 Supercedes dask#158

Clean up a few things while implementing arrays

a9b806a

Add trivial Array implementation

5c313ae

mrocklin force-pushed the arrays branch from 488dd75 to 5c313ae Compare December 6, 2023 01:11

mrocklin added 6 commits December 7, 2023 17:04

Merge branch 'main' of github.com:dask-contrib/dask-expr into separat…

04d78a5

…e-core-expr

Merge branch 'main' of github.com:dask-contrib/dask-expr into arrays

a0821ac

Add basic rechunk, also move to a directory

f2d0234

add Rechunk(Rechunk(...)) simplification

288a1ba

Merge branch 'main' of github.com:dask-contrib/dask-expr into separat…

aad8e7e

…e-core-expr

Merge branch 'separate-core-expr' into arrays

f13715e

mrocklin added 3 commits December 8, 2023 10:10

Add basic blockwise functionality

d360ca5

add transpose

e6385dd

Simplify blockwise + rechunk + IO

b9a51e9

mrocklin added 2 commits December 8, 2023 14:25

Add basic slicing

364db52

Support xarray rechunking

28fe54f

mrocklin commented Dec 9, 2023

View reviewed changes

mrocklin added 5 commits December 8, 2023 17:51

Some slicing elemwise optimizations

673c28f

add comment around elemwise future

e37cd20

Add simplified random implementation

67635c0

Build smarter Elemwise class, improve rechunk/slicing optimizations

d2d3343

Add basic reduction implementation

76d6849

dcherian reviewed Dec 17, 2023

View reviewed changes

dask_expr/array/core.py Show resolved Hide resolved

mrocklin added 2 commits December 17, 2023 18:25

Merge branch 'main' of github.com:dask-contrib/dask-expr into arrays

8fc48af

Support compute and persist methods

2bb801f

mrocklin added 2 commits December 18, 2023 14:14

Add ufuncs, operators, and more reductions

94d46f9

Merge branch 'main' of github.com:dask-contrib/dask-expr into arrays

23cd05f

mrocklin added 2 commits January 8, 2024 15:44

Merge branch 'main' of github.com:dask-contrib/dask-expr into arrays

8d24c4a

Avoid objects in pprint

c386a2d

fjetter mentioned this pull request Jun 24, 2024

Add first array draft #1090

Merged

phofl closed this Jun 25, 2024

mrocklin deleted the arrays branch June 25, 2024 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrays #471

Arrays #471

mrocklin commented Dec 6, 2023 •

edited

Loading

mrocklin commented Dec 6, 2023

mrocklin commented Dec 8, 2023

mrocklin commented Dec 8, 2023

mrocklin Dec 9, 2023

dcherian Dec 10, 2023

mrocklin commented Dec 10, 2023

mrocklin commented Dec 15, 2023

dcherian commented Dec 17, 2023

mrocklin commented Dec 18, 2023

phofl commented Jun 25, 2024

mrocklin commented Jun 25, 2024 via email

dcherian commented Jun 25, 2024

mrocklin commented Jun 25, 2024

dcherian commented Jun 25, 2024


		x = (xr.DataArray(b, dims=["x", "y"]) + 1).chunk(x=2)

		assert x.data.optimize()._name == (da.from_array(a, chunks={0: 2}) + 1)._name

Arrays #471

Arrays #471

Conversation

mrocklin commented Dec 6, 2023 • edited Loading

Examples

mrocklin commented Dec 6, 2023

mrocklin commented Dec 8, 2023

mrocklin commented Dec 8, 2023

mrocklin Dec 9, 2023

Choose a reason for hiding this comment

dcherian Dec 10, 2023

Choose a reason for hiding this comment

mrocklin commented Dec 10, 2023

mrocklin commented Dec 15, 2023

dcherian commented Dec 17, 2023

mrocklin commented Dec 18, 2023

phofl commented Jun 25, 2024

mrocklin commented Jun 25, 2024 via email

dcherian commented Jun 25, 2024

mrocklin commented Jun 25, 2024

dcherian commented Jun 25, 2024

mrocklin commented Dec 6, 2023 •

edited

Loading