-
Notifications
You must be signed in to change notification settings - Fork 4
How ArrayBuilder works
There are essentially three parts: (1) array representations, (2) interpretations, and (3) the ArrayBuilder
.
The array representations are a Numpy and Awkward-Array replacement, implementing only the numerical, contiguously-sliced, 1 and 2-dimensional arrays of Numpy and jagged arrays of Awkward-Array. To get Numpy's "view" capabilities (in which an array and a slice of that array share underlying data), we rely heavily on Java's Buffer class, which has subclasses ByteBuffer
(used for our RawArray
) and specializations for each data type (integers and floating point numbers of various widths).
To make the translation from Numpy to Java's Buffer
, we consider an underlying array with elements labeled from 0
to capacity
and a view underlying[position:limit]
where position
is the first element we're viewing and limit
is one after the last element we're viewing. Java's capacity
, position
, and limit
are element numbers, not bytes. For example, if we view a DoubleBuffer
as a ByteBuffer
, all three positions get multiplied by 8.
Note that Java's ByteBuffer.duplicate creates a new object pointing at the buffer but does not copy the buffer itself. We therefore use ByteBuffer.duplicate
to make slices in a way that we would use ndarray.__getitem__
in Numpy. Since only ByteBuffer
has this method, we store all PrimitiveArray
data in ByteBuffers
(see this member) instead of type-specialized buffers.
The slicing operation, underlying[position:limit]
, is abstracted in Array.clip, a virtual method implemented by all Array
subclasses. The slicing arithmetic is done on bytes to make use of ByteBuffer.duplicate
.
To handle 2-dimensional primitive arrays, we make a distinction between
- items: individual numbers, whether 1 or 2-dimensional, same usage as Numpy;
- multiplicity: the number of items in one entry, 1 for 1-dimensional;
- entry: one "event," same usage as ROOT.
Thus, itemstart
and itemstop
are measured in units of the specialized Buffer's elements, or the number of bytes divided by itemsize
, whereas entrystart
and entrystop
are measured in the same units as ROOT's GetEntry
.
For jagged arrays, we do the same thing as Awkward-Array with one simplification: the data contained by a jagged array are always contiguous, so we can fully define a jagged array with content
and offsets
. A jagged array represents a list of variable-length lists.
- content: the concatenated nested lists with no demarcations between entries;
-
offsets: a 32-bit integer array of where each nested list starts and stops. This array always begins with
0
and has one more element than the number of entries. In Numpy's slice syntax,offsets[:-1]
is an array of where each nested list starts (inclusive) andoffsets[1:]
is an array of where each nested list stops (exclusive), and there are exactly as many starts as number of entries, exactly as many stops as number of entries.