Skip to content

How ArrayBuilder works

Jim Pivarski edited this page Aug 28, 2019 · 17 revisions

There are essentially three parts: (1) array representations, (2) interpretations, and (3) the ArrayBuilder.

Array representations

The array representations are a Numpy and Awkward-Array replacement, implementing only the numerical, contiguously-sliced, 1 and 2-dimensional arrays of Numpy and jagged arrays of Awkward-Array. To get Numpy's "view" capabilities (in which an array and a slice of that array share underlying data), we rely heavily on Java's Buffer class, which has subclasses ByteBuffer (used for our RawArray) and specializations for each data type (integers and floating point numbers of various widths).

To make the translation from Numpy to Java's Buffer, we consider an underlying array with elements labeled from 0 to capacity and a view underlying[position:limit] where position is the first element we're viewing and limit is one after the last element we're viewing. Java's capacity, position, and limit are element numbers, not bytes. For example, if we view a DoubleBuffer as a ByteBuffer, all three positions get multiplied by 8.

Note that Java's ByteBuffer.duplicate creates a new object pointing at the buffer but does not copy the buffer itself. We therefore use ByteBuffer.duplicate to make slices in a way that we would use ndarray.__getitem__ in Numpy. Since only ByteBuffer has this method, we store all PrimitiveArray data in ByteBuffers (see this member) instead of type-specialized buffers.

The slicing operation, underlying[position:limit], is abstracted in Array.clip, a virtual method implemented by all Array subclasses. The slicing arithmetic is done on bytes to make use of ByteBuffer.duplicate.

To handle 2-dimensional primitive arrays, we make a distinction between

  • items: individual numbers, whether 1 or 2-dimensional, same usage as Numpy;
  • multiplicity: the number of items in one entry, 1 for 1-dimensional;
  • entry: one "event," same usage as ROOT.

Thus, itemstart and itemstop are measured in units of the specialized Buffer's elements, or the number of bytes divided by itemsize, whereas entrystart and entrystop are measured in the same units as ROOT's GetEntry.

For jagged arrays, we do the same thing as Awkward-Array with one simplification: the data contained by a jagged array are always contiguous, so we can fully define a jagged array with content and offsets. A jagged array represents a list of variable-length lists.

  • content: the concatenated nested lists with no demarcations between entries;
  • offsets: a 32-bit integer array of where each nested list starts and stops. This array always begins with 0 and has one more element than the number of entries. In Numpy's slice syntax, offsets[:-1] is an array of where each nested list starts (inclusive) and offsets[1:] is an array of where each nested list stops (exclusive), and there are exactly as many starts as number of entries, exactly as many stops as number of entries.
Clone this wiki locally