-
Notifications
You must be signed in to change notification settings - Fork 4
How ArrayBuilder works
There are essentially three parts: (1) array representations, (2) interpretations, and (3) the ArrayBuilder
.
The array representations are a Numpy and Awkward-Array replacement, implementing only the numerical, contiguously-sliced, 1 and 2-dimensional arrays of Numpy and jagged arrays of Awkward-Array. To get Numpy's "view" capabilities (in which an array and a slice of that array share underlying data), we rely heavily on Java's Buffer class, which has subclasses ByteBuffer
(used for our RawArray
) and specializations for each data type (integers and floating point numbers of various widths).
To make the translation from Numpy to Java's Buffer
, we consider an underlying array with elements labeled from 0
to capacity
and a view underlying[position:limit]
where position
is the first element we're viewing and limit
is one after the last element we're viewing. Java's capacity
, position
, and limit
are element numbers, not bytes. For example, if we view a DoubleBuffer
as a ByteBuffer
, all three positions get multiplied by 8.
Note that Java's ByteBuffer.duplicate creates a new object pointing at the buffer but does not copy the buffer itself. We therefore use ByteBuffer.duplicate
to make slices in a way that we would use ndarray.__getitem__
in Numpy. Since only ByteBuffer
has this method, we store all PrimitiveArray
data in ByteBuffers
(see this member) instead of type-specialized buffers.
The slicing operation, underlying[position:limit]
, is abstracted in Array.clip, a virtual method implemented by all Array
subclasses. The slicing arithmetic is done on bytes to make use of ByteBuffer.duplicate
.
To handle 2-dimensional primitive arrays, we make a distinction between
- items: individual numbers, whether 1 or 2-dimensional, same usage as Numpy;
- multiplicity: the number of items in one entry, 1 for 1-dimensional;
- entry: one "event," same usage as ROOT.
Thus, itemstart
and itemstop
are measured in units of the specialized Buffer's elements, or the number of bytes divided by itemsize
, whereas entrystart
and entrystop
are measured in the same units as ROOT's GetEntry
.
For jagged arrays, we do the same thing as Awkward-Array with one simplification: the data contained by a jagged array are always contiguous, so we can fully define a jagged array with content
and offsets
. A jagged array represents a list of variable-length lists.
- content: the concatenated nested lists with no demarcations between entries;
-
offsets: a 32-bit integer array of where each nested list starts and stops. This array always begins with
0
and has one more element than the number of entries. In Numpy's slice syntax,offsets[:-1]
is an array of where each nested list starts (inclusive) andoffsets[1:]
is an array of where each nested list stops (exclusive), and there are exactly as many starts as number of entries, exactly as many stops as number of entries.
The JaggedArray class is dead code and should be deleted. The JaggedArrayPrep is the real thing and should be renamed "JaggedArray
". The difference is that JaggedArrayPrep
has a counts attribute which, in uproot, is only used to build the JaggedArray
, but in Laurelin, it is a permanent member because the "finalization" stage (described below) has to happen later than it does in uproot.
-
counts: a 32-bit integer array of the length of each nested list. It has the same number of elements as the number of entries. You could say that
counts
is a derivative ofoffsets
:counts = offsets[:-1] - offsets[1:]
.
The value of counts
is that a counts array can be filled in parallel: to fill entries 100:200
, one does not need to know the values of entries 0:100
. When baskets are read in parallel, they don't have to lock each other to fill the whole-branch array. The value of offsets
is that the data structure is random-access. The offsets
are only computed once: the first time an array is finalized (and JaggedArrayPrep.clip is called), the offsets
are filled.
I wasn't sure, so I looked at PrimitiveArray.subarray
, which are implemented. It looks like array.subarray()
in Laurelin means what array[0]
would mean in Numpy/Awkward: reduce an N
-dimensional array to an N-1
-dimensional array by taking the first element in the first dimension.
I think this was a later addition because Spark's ColumnVector
works in a surprising way: TTreeColumnVector calls ArrayBuilder.getArray with knowledge of which rowId
(i.e. entry number) that we want, so we view (i.e. clip
) the Array
starting at exactly the entry we want. So the view gives us [what_we_want, other stuff...]
and we need to pick the zeroth element of that. Thus, subarray
is a method whose purpose is to select the first element of a more-than-one dimensional array (either fixed-size 2-dimensional or jagged).
With this interpretation, it looks like JaggedArrayPrep.subarray should be
@Override
public Array subarray() {
return this.content.clip(this.offsets.get(0), this.offsets.get(1));
}
after ensuring that offsets
is not null
(perhaps the code that generates it should be moved out of JaggedArrayPrep.clip
into a private method that both call.