Skip to content

How ArrayBuilder works

Jim Pivarski edited this page Aug 29, 2019 · 17 revisions

There are essentially three parts: (1) array representations, (2) interpretations, and (3) the ArrayBuilder.

1. Array representations

The array representations are a Numpy and Awkward-Array replacement, implementing only the numerical, contiguously-sliced, N-dimensional arrays of Numpy and jagged arrays of Awkward-Array. To get Numpy's "view" capabilities (in which an array and a slice of that array share underlying data), we rely heavily on Java's Buffer class, which has subclasses ByteBuffer (used for our RawArray) and specializations for each data type (integers and floating point numbers of various widths).

To make the translation from Numpy to Java's Buffer, we consider an underlying array with elements labeled from 0 to capacity and a view underlying[position:limit] where position is the first element we're viewing and limit is one after the last element we're viewing. Java's capacity, position, and limit are element numbers, not bytes. For example, if we view a DoubleBuffer as a ByteBuffer, all three positions get multiplied by 8.

Note that Java's ByteBuffer.duplicate creates a new object pointing at the buffer but does not copy the buffer itself. We therefore use ByteBuffer.duplicate to make slices in a way that we would use ndarray.__getitem__ in Numpy. Since only ByteBuffer has this method, we store all PrimitiveArray data in ByteBuffers (see this member) instead of type-specialized buffers.

The slicing operation, underlying[position:limit], is abstracted in Array.clip, a virtual method implemented by all Array subclasses. The slicing arithmetic is done on bytes to make use of ByteBuffer.duplicate.

To handle N-dimensional primitive arrays, we make a distinction between

  • items: individual numbers, whether 1 or N-dimensional, same usage as Numpy;
  • multiplicity: the number of items in one entry, 1 for 1-dimensional;
  • entry: one "event," same usage as ROOT.

Thus, itemstart and itemstop are measured in units of the specialized Buffer's elements, or the number of bytes divided by itemsize, whereas entrystart and entrystop are measured in the same units as ROOT's GetEntry.

From the metadata directly in the PrimitiveArray class, we can only see the distinction between the top-level entry size and the bottom-level number of items in each entry—it would seem that the distinctions of how (for example) a 1000000×4×3 array differs from a 1000000×12 array is lost. Actually, this "shape" (in Numpy terms) is in the Interpretation, which for PrimitiveArrays is always an AsDtype (with a List dims). The interpretation is a member of the Array.

For jagged arrays, we do the same thing as Awkward-Array with one simplification: the data contained by a jagged array are always contiguous, so we can fully define a jagged array with content and offsets. A jagged array represents a list of variable-length lists.

  • content: the concatenated nested lists with no demarcations between entries;
  • offsets: a 32-bit integer array of where each nested list starts and stops. This array always begins with 0 and has one more element than the number of entries. In Numpy's slice syntax, offsets[:-1] is an array of where each nested list starts (inclusive) and offsets[1:] is an array of where each nested list stops (exclusive), and there are exactly as many starts as number of entries, exactly as many stops as number of entries.

The JaggedArray class is dead code and should be deleted. The JaggedArrayPrep is the real thing and should be renamed "JaggedArray". The difference is that JaggedArrayPrep has a counts attribute which, in uproot, is only used to build the JaggedArray, but in Laurelin, it is a permanent member because the "finalization" stage (described below) has to happen later than it does in uproot.

  • counts: a 32-bit integer array of the length of each nested list. It has the same number of elements as the number of entries. You could say that counts is a derivative of offsets: counts = offsets[:-1] - offsets[1:].

The value of counts is that a counts array can be filled in parallel: to fill entries 100:200, one does not need to know the values of entries 0:100. When baskets are read in parallel, they don't have to lock each other to fill the whole-branch array. The value of offsets is that the data structure is random-access. The offsets are only computed once: the first time an array is finalized (and JaggedArrayPrep.clip is called), the offsets are filled.

The intention is for jagged arrays to be treated as any other kind of array. It would even be possible to express nested jagged arrays, but they don't appear in ROOT data (in columnar form, so out of scope for Laurelin).

Specific question: what is JaggedArrayPrep.subarray supposed to do?

I wasn't sure, so I looked at PrimitiveArray.subarray, which are implemented. It looks like array.subarray() in Laurelin means what array[0] would mean in Numpy/Awkward: reduce an N-dimensional array to an N-1-dimensional array by taking the first element in the first dimension.

I think this was a later addition because Spark's ColumnVector works in a surprising way: TTreeColumnVector calls ArrayBuilder.getArray with knowledge of which rowId (i.e. entry number) that we want, so we view (i.e. clip) the Array starting at exactly the entry we want. So the view gives us [what_we_want, other stuff...] and we need to pick the zeroth element of that. Thus, subarray is a method whose purpose is to select the first element of a more-than-one dimensional array (either fixed-size 2-dimensional or jagged).

With this interpretation, it looks like JaggedArrayPrep.subarray should be

@Override
public Array subarray() {
    return this.content.clip(this.offsets.get(0), this.offsets.get(1));
}

after ensuring that offsets is not null (perhaps the code that generates it should be moved out of JaggedArrayPrep.clip into a private method that both call.

2. Interpretations

Interpretations are both metadata for an Array, describing what would in Numpy be called its dtype and shape, and also a machine for converting bytes from ROOT TBaskets into an Array. The workflow for this machine is as follows:

  1. An entry range for a TBranch is specified. We'll want an Array representing all the data from entrystart to entrystop, with no regard for which basket the data are in.
  2. A specific Interpretation is given for the TBranch. This is determined by inspecting TBranch metadata, such as the TLeaf.
  3. TBranch metadata doesn't specify the number of items (which we need to allocate the array), but it does specify the number of bytes and the number of entries, and for all the data types Laurelin supports, we can exactly infer the number of items by division: the builder calls Interpretation.numitems. (There are other types, such as strings, for which a slight overestimate can be inferred; we've been lucky with the metadata ROOT gives us—it wasn't intended for conversion to preallocated arrays.)
  4. If the entry range corresponds to no data, the Interpretation can produce an empty Array: the builder calls Interpretation.empty.
  5. Otherwise, the builder identifies which baskets the entrystart, entrystop range covers and launches a task for each TBasket. Before forking into tasks, it asks the Interpretation to create an empty output array using Interpretation.destination. The size of this array rounds up to the nearest TBasket boundaries; it may start before entrystart and/or stop after entrystop, with the understanding that we will clip off the excess in the finalization step.
  6. Each task (which may be run in parallel in an Executor) calls Interpretation.fromroot to convert one TBasket into an array for just that TBasket. This is the one and only point where ROOT bytes are converted into an array.
  7. Still in the possibly-parallel task, Interpretation.fill is called to copy the single-TBasket array into the destination array.
  8. In uproot, we finish by clipping the excess entries that we got by rounding up to the nearest TBaskets and "finalizing," which means converting counts to offsets in jagged arrays and other things in more exotic data types like Double32. In Laurelin, these steps are postponed until somebody calls ArrayBuilder.getArray, to give control to the calling code so that it can perform other tasks between constructing the ArrayBuilder and extracting the results with getArray. This, for instance, lets TBaskets from different TBranches be read in parallel, filling up the Executor, avoiding serialization points. (Uproot does this with a more complex composition of futures.)
  9. When TTreeColumnVector does call ArrayBuilder.getArray, this calls the Interpretation's last two steps: Interpretation.clip (to remove excess entries from the rounded-up TBaskets and to remove everything below the desired rowId, since we know that in Laurelin), and Interpretation.finalize. In uproot, these are only called once, but in Laurelin, they're called for every getArray. However, they're very lightweight operations (remember that JaggedArrayPrep only calculates offsets once).

This may sound like a complex process, but it provides all the hooks for any type of data to be interpreted with the same interface: ArrayBuilder knows nothing about data types, and Interpretations are stateless tools wielded by the ArrayBuilder. It separates the single-TBasket interpretation process from the allocation and filling of a destination, which additionally allows the TBasket reading (and decompression!) to be parallelized. Some steps, like "finalization," are not relevant for many data types, but exists for those that do.

Furthermore, the process can be composed. The AsJagged interpretation contains its content's interpretation (for Laurelin, only AsDtype, but uproot shows how this can be generalized). Each of these stages, like AsJagged.fromroot, does what it needs to do and then calls its content.fromroot, passing down the chain.

The Interpretation framework has just enough complexity to permit that, significantly reducing the

3. ArrayBuilder

Clone this wiki locally