Replies: 3 comments 2 replies
-
@douglasdavis , agree that it would be nice to have that simple workflow in a notebook in our repo. We can update and extend it as the project grows, and even perhaps put in workflows that we anticipate developing. One blocker is, that the demo worked on large downloaded files. We should probably fix ak.from_parquet to work with fsspec file systems (and URLs) sooner rather than later. |
Beta Was this translation helpful? Give feedback.
-
A notebook showing some usage of the existing simple collection sounds like a good plan, I'll try to think of a temporary workaround w.r.t. the large dataset for now. |
Beta Was this translation helpful? Give feedback.
-
So a couple of points that were mentioned in the meeting, concerning the metadata or schema that we would like to maintain in the Dask collection:
|
Beta Was this translation helpful? Give feedback.
-
This is the first question I was asking about/thinking about in this project (from the Dask Summit, slides here):
In our first meeting, the number of possibilities reduced from 3 to 2 (dropping "mid-level"), and the names are "user-level DAG" (Dask graph nodes are
ak.*
function calls) and "kernel-level DAG" (Dask graph nodes are kernel function calls) to avoid confusion with Dask's own "high-level graphs."From the conversation and a prototype I implemented, it became clear that a user-level DAG would fit better into Dask's own machinery. Kernels act purely through in-place operations, but Dask (and high-level Awkward) are functional/immutable. A task graph for kernel-level operations would also get very large, as a simple expression like
listarray[2][0]
had 17 nodes.The downside of a user-level DAG is that we don't have a general way to predict what the type (high-level Type or low-level Form) of an array will be, post-computation. However, it seems that the two strategies can be combined: a kernel-level tracer can provide type information for a user-level DAG. That tracer does not need to build up a DAG for delayed evaluation; it just needs to pass through the Awkward codebase without knowing the values of data in array buffers. Since values are used to compute some lengths/shapes, the tracer also can't know all of the shapes (though it can know some, maybe useful for some optimizations).
The full set of data that the tracer would need to know and provide, at each step in the Awkward codebase, for all array buffers, is
dtype
andndim
. The Awkward code will have to be refactored to replace expressions likewith
and have fallbacks for other cases in which the
shape
is asked for but not accessible. I think that will be possible.Since the kernel-level tracer and the Awkward codebase are both things we can modify, we have a lot of flexibility. Then I thought that JAX's tracer might be the biggest imposition on our flexibility, but after looking into that with @stormiestsin, it appears that JAX's JIT-compilation (which uses a tracer that requires shapes) is a non-starter and JAX's autodiff (which is eager; no tracer at all) is completely unconstrained. So we're back to the kernel-level tracer being the primary source of constraints. Which is a good place to be.
@douglasdavis Can we link in your demo in this discussion? Is there an ipynb notebook we can include?
Beta Was this translation helpful? Give feedback.
All reactions