User-level DAG vs kernel-level DAG #2

jpivarski · 2021-09-23T16:47:03Z

jpivarski
Sep 23, 2021
Maintainer

This is the first question I was asking about/thinking about in this project (from the Dask Summit, slides here):

In our first meeting, the number of possibilities reduced from 3 to 2 (dropping "mid-level"), and the names are "user-level DAG" (Dask graph nodes are ak.* function calls) and "kernel-level DAG" (Dask graph nodes are kernel function calls) to avoid confusion with Dask's own "high-level graphs."

From the conversation and a prototype I implemented, it became clear that a user-level DAG would fit better into Dask's own machinery. Kernels act purely through in-place operations, but Dask (and high-level Awkward) are functional/immutable. A task graph for kernel-level operations would also get very large, as a simple expression like listarray[2][0] had 17 nodes.

The downside of a user-level DAG is that we don't have a general way to predict what the type (high-level Type or low-level Form) of an array will be, post-computation. However, it seems that the two strategies can be combined: a kernel-level tracer can provide type information for a user-level DAG. That tracer does not need to build up a DAG for delayed evaluation; it just needs to pass through the Awkward codebase without knowing the values of data in array buffers. Since values are used to compute some lengths/shapes, the tracer also can't know all of the shapes (though it can know some, maybe useful for some optimizations).

The full set of data that the tracer would need to know and provide, at each step in the Awkward codebase, for all array buffers, is dtype and ndim. The Awkward code will have to be refactored to replace expressions like

len(self._data.shape) == 1

with

self._data.ndim == 1

and have fallbacks for other cases in which the shape is asked for but not accessible. I think that will be possible.

Since the kernel-level tracer and the Awkward codebase are both things we can modify, we have a lot of flexibility. Then I thought that JAX's tracer might be the biggest imposition on our flexibility, but after looking into that with @stormiestsin, it appears that JAX's JIT-compilation (which uses a tracer that requires shapes) is a non-starter and JAX's autodiff (which is eager; no tracer at all) is completely unconstrained. So we're back to the kernel-level tracer being the primary source of constraints. Which is a good place to be.

@douglasdavis Can we link in your demo in this discussion? Is there an ipynb notebook we can include?

martindurant · 2021-09-23T17:14:06Z

martindurant
Sep 23, 2021
Maintainer

@douglasdavis , agree that it would be nice to have that simple workflow in a notebook in our repo. We can update and extend it as the project grows, and even perhaps put in workflows that we anticipate developing.

One blocker is, that the demo worked on large downloaded files. We should probably fix ak.from_parquet to work with fsspec file systems (and URLs) sooner rather than later.

0 replies

douglasdavis · 2021-09-23T17:35:07Z

douglasdavis
Sep 23, 2021
Maintainer

A notebook showing some usage of the existing simple collection sounds like a good plan, I'll try to think of a temporary workaround w.r.t. the large dataset for now.

1 reply

martindurant Sep 23, 2021
Maintainer

Maybe in the long run we don't want to rely on data in our own buckets anyway, since, if people start using the examples for learning, Jim's bucket will get hit by egress charges. Either we need suitable data on totally public buckets (/http), or generate example data on the fly. The latter isn't nearly as interesting!

martindurant · 2021-09-24T13:59:57Z

martindurant
Sep 24, 2021
Maintainer

So a couple of points that were mentioned in the meeting, concerning the metadata or schema that we would like to maintain in the Dask collection:

the most important part of a delayed (dask) awkward array that we must have, in order to make sensible decisions on further methods (e.g., record versus list, at what point something becomes a scalar, what axes are allowed in a call to sum) is the type/datashape
the next best thing would be the set of field names (and their types) within a struct, e.g., so that we can implement tab-completion. This should definitely be known in the layout form. It is not known for map types, where arbitrary (but typed) key-values are allowed.
finally shape (dimension lengths), each element of which may or may not be known at various points in the graph. Where a part of the shape is unknown, selection/operations along that axis may result in an exception at evaluation time, potentially after some significant compute work.

1 reply

jpivarski Sep 24, 2021
Maintainer Author

It seems no harder to provide a full Form, from which a Type can be derived, so we should try to do that with the kernel-level tracer.

We can also do better than track ndim. This is only relevant for NumpyArray nodes, since Index must have ndim == 1 upon entry, and the NumpyForm requires an "inner shape," which is shape[1:]. That inner shape will be necessary for some operations that convert a NumpyArray into nested RegularArrays of a one-dimensional NumpyArray, which happens often enough in the codebase that the kernel-level tracer would certainly run into it. So if we say that the goal is to provide a Form, then the ndim comes along for the ride because ndim == len(inner_shape) + 1.

Field names also come along for the ride; it comes from the Form as .keys(). There are interface-breaking changes from v1 to v2 here, since the interface to keys is inconsistent in v1 and was made more consistent in v2. The ak.fields function gets the fields from an array, but we could make it get the fields from a Form in a high-level, v1/v2-independent way (if it doesn't already). The ak.operations.describe.* functions (like this one) are quite open-ended about what input types they're given; they should give appropriate answers when given NumPy arrays, for example.

There are no map types, arbitrary key-value pairs, in Awkward Array yet. Such a thing would be added as a parameter label on 2-tuples representing key-value pairs, so the keys would all be (typed) data, whose values could not be obtained from a Form. The most I've written about this intention is here: scikit-hep/awkward#780 (comment) though Uproot is even anticipating that it will exist (here). As is, the label name "sorted_map" is reserved for it, but no behaviors have been written that would implement a lookup. (The name presumes we would do binary search lookup, since hashmap lookups would require creating another data structure, which goes beyond Awkward Array's scope.)

Thinking more about the partial shape information, which is actually just "length" (one integer), since the inner_shape will be preserved: should there be any states between "I know the length as an exact integer" and "I have no idea what the length is"? For instance, would an interval be useful? Indexes representing ListOffsetArray offsets will always have non-zero length, and maybe that would be useful to know. So maybe three states:

exact integer
not empty
unknown

or more?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-level DAG vs kernel-level DAG #2

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

User-level DAG vs kernel-level DAG #2

jpivarski Sep 23, 2021 Maintainer

Replies: 3 comments · 2 replies

martindurant Sep 23, 2021 Maintainer

douglasdavis Sep 23, 2021 Maintainer

martindurant Sep 23, 2021 Maintainer

martindurant Sep 24, 2021 Maintainer

jpivarski Sep 24, 2021 Maintainer Author

jpivarski
Sep 23, 2021
Maintainer

Replies: 3 comments 2 replies

martindurant
Sep 23, 2021
Maintainer

douglasdavis
Sep 23, 2021
Maintainer

martindurant Sep 23, 2021
Maintainer

martindurant
Sep 24, 2021
Maintainer

jpivarski Sep 24, 2021
Maintainer Author