-
Notifications
You must be signed in to change notification settings - Fork 4
How ArrayBuilder works
There are essentially three parts: (1) array representations, (2) interpretations, and (3) the ArrayBuilder
.
The array representations are a Numpy and Awkward-Array replacement, implementing only the numerical, contiguously-sliced, N-dimensional arrays of Numpy and jagged arrays of Awkward-Array. To get Numpy's "view" capabilities (in which an array and a slice of that array share underlying data), we rely heavily on Java's Buffer class, which has subclasses ByteBuffer
(used for our RawArray
) and specializations for each data type (integers and floating point numbers of various widths).
To make the translation from Numpy to Java's Buffer
, we consider an underlying array with elements labeled from 0
to capacity
and a view underlying[position:limit]
where position
is the first element we're viewing and limit
is one after the last element we're viewing. Java's capacity
, position
, and limit
are element numbers, not bytes. For example, if we view a DoubleBuffer
as a ByteBuffer
, all three positions get multiplied by 8.
Note that Java's ByteBuffer.duplicate creates a new object pointing at the buffer but does not copy the buffer itself. We therefore use ByteBuffer.duplicate
to make slices in a way that we would use ndarray.__getitem__
in Numpy. Since only ByteBuffer
has this method, we store all PrimitiveArray
data in ByteBuffers
(see this member) instead of type-specialized buffers.
The slicing operation, underlying[position:limit]
, is abstracted in Array.clip, a virtual method implemented by all Array
subclasses. The slicing arithmetic is done on bytes to make use of ByteBuffer.duplicate
.
To handle N-dimensional primitive arrays, we make a distinction between
- items: individual numbers, whether 1 or N-dimensional, same usage as Numpy;
- multiplicity: the number of items in one entry, 1 for 1-dimensional;
- entry: one "event," same usage as ROOT.
Thus, itemstart
and itemstop
are measured in units of the specialized Buffer's elements, or the number of bytes divided by itemsize
, whereas entrystart
and entrystop
are measured in the same units as ROOT's GetEntry
.
From the metadata directly in the PrimitiveArray
class, we can only see the distinction between the top-level entry size and the bottom-level number of items in each entry—it would seem that the distinctions of how (for example) a 1000000×4×3 array differs from a 1000000×12 array is lost. Actually, this "shape" (in Numpy terms) is in the Interpretation
, which for PrimitiveArrays
is always an AsDtype
(with a List dims). The interpretation
is a member of the Array.
For jagged arrays, we do the same thing as Awkward-Array with one simplification: the data contained by a jagged array are always contiguous, so we can fully define a jagged array with content
and offsets
. A jagged array represents a list of variable-length lists.
- content: the concatenated nested lists with no demarcations between entries;
-
offsets: a 32-bit integer array of where each nested list starts and stops. This array always begins with
0
and has one more element than the number of entries. In Numpy's slice syntax,offsets[:-1]
is an array of where each nested list starts (inclusive) andoffsets[1:]
is an array of where each nested list stops (exclusive), and there are exactly as many starts as number of entries, exactly as many stops as number of entries.
The JaggedArray class is dead code and should be deleted. The JaggedArrayPrep is the real thing and should be renamed "JaggedArray
". The difference is that JaggedArrayPrep
has a counts attribute which, in uproot, is only used to build the JaggedArray
, but in Laurelin, it is a permanent member because the "finalization" stage (described below) has to happen later than it does in uproot.
-
counts: a 32-bit integer array of the length of each nested list. It has the same number of elements as the number of entries. You could say that
counts
is a derivative ofoffsets
:counts = offsets[:-1] - offsets[1:]
.
The value of counts
is that a counts array can be filled in parallel: to fill entries 100:200
, one does not need to know the values of entries 0:100
. When baskets are read in parallel, they don't have to lock each other to fill the whole-branch array. The value of offsets
is that the data structure is random-access. The offsets
are only computed once: the first time an array is finalized (and JaggedArrayPrep.clip is called), the offsets
are filled.
The intention is for jagged arrays to be treated as any other kind of array. It would even be possible to express nested jagged arrays, but they don't appear in ROOT data (in columnar form, so out of scope for Laurelin).
I wasn't sure, so I looked at PrimitiveArray.subarray
, which are implemented. It looks like array.subarray()
in Laurelin means what array[0]
would mean in Numpy/Awkward: reduce an N
-dimensional array to an N-1
-dimensional array by taking the first element in the first dimension.
I think this was a later addition because Spark's ColumnVector
works in a surprising way: TTreeColumnVector calls ArrayBuilder.getArray with knowledge of which rowId
(i.e. entry number) that we want, so we view (i.e. clip
) the Array
starting at exactly the entry we want. So the view gives us [what_we_want, other stuff...]
and we need to pick the zeroth element of that. Thus, subarray
is a method whose purpose is to select the first element of a more-than-one dimensional array (either fixed-size 2-dimensional or jagged).
With this interpretation, it looks like JaggedArrayPrep.subarray should be
@Override
public Array subarray() {
return this.content.clip(this.offsets.get(0), this.offsets.get(1));
}
after ensuring that offsets
is not null
(perhaps the code that generates it should be moved out of JaggedArrayPrep.clip
into a private method that both call.
Interpretations are both metadata for an Array
, describing what would in Numpy be called its dtype
and shape
, and also a machine for converting bytes from ROOT TBaskets into an Array
. The workflow for this machine is as follows:
- An entry range for a TBranch is specified. We'll want an
Array
representing all the data fromentrystart
toentrystop
, with no regard for which basket the data are in. - A specific
Interpretation
is given for the TBranch. This is determined by inspecting TBranch metadata, such as theTLeaf
. - TBranch metadata doesn't specify the number of items (which we need to allocate the array), but it does specify the number of bytes and the number of entries, and for all the data types Laurelin supports, we can exactly infer the number of items by division: the builder calls Interpretation.numitems. (There are other types, such as strings, for which a slight overestimate can be inferred; we've been lucky with the metadata ROOT gives us—it wasn't intended for conversion to preallocated arrays.)
- If the entry range corresponds to no data, the
Interpretation
can produce an emptyArray
: the builder calls Interpretation.empty. - Otherwise, the builder identifies which baskets the
entrystart
,entrystop
range covers and launches a task for each TBasket. Before forking into tasks, it asks theInterpretation
to create an empty output array using Interpretation.destination. The size of this array rounds up to the nearest TBasket boundaries; it may start beforeentrystart
and/or stop afterentrystop
, with the understanding that we will clip off the excess in the finalization step. - Each task (which may be run in parallel in an Executor) calls Interpretation.fromroot to convert one TBasket into an array for just that TBasket. This is the one and only point where ROOT bytes are converted into an array.
- Still in the possibly-parallel task, Interpretation.fill is called to copy the single-TBasket array into the destination array.
- In uproot, we finish by clipping the excess entries that we got by rounding up to the nearest TBaskets and "finalizing," which means converting
counts
tooffsets
in jagged arrays and other things in more exotic data types likeDouble32
. In Laurelin, these steps are postponed until somebody calls ArrayBuilder.getArray, to give control to the calling code so that it can perform other tasks between constructing the ArrayBuilder and extracting the results with getArray. This, for instance, lets TBaskets from different TBranches be read in parallel, filling up the Executor, avoiding serialization points. (Uproot does this with a more complex composition of futures.) - When
TTreeColumnVector
does callArrayBuilder.getArray
, this calls theInterpretation
's last two steps: Interpretation.clip (to remove excess entries from the rounded-up TBaskets and to remove everything below the desiredrowId
, since we know that in Laurelin), and Interpretation.finalize. In uproot, these are only called once, but in Laurelin, they're called for everygetArray
. However, they're very lightweight operations (remember thatJaggedArrayPrep
only calculatesoffsets
once).
This may sound like a complex process, but it provides all the hooks for any type of data to be interpreted with the same interface: ArrayBuilder
knows nothing about data types, and Interpretations
are stateless tools wielded by the ArrayBuilder
. It separates the single-TBasket interpretation process from the allocation and filling of a destination, which additionally allows the TBasket reading (and decompression!) to be parallelized. Some steps, like "finalization," are not relevant for many data types, but exists for those that do.
Furthermore, the process can be composed. The AsJagged
interpretation contains its content's interpretation (for Laurelin, only AsDtype
, but uproot shows how this can be generalized). Each of these stages, like AsJagged.fromroot, does what it needs to do and then calls its content.fromroot, passing down the chain.
The Interpretation
framework has just enough complexity to permit that, significantly reducing the number of special cases that would otherwise have to be handled to cover all combinations of data types (in a columnar way).
Whereas an Interpretation
has hooks that each perform one type-aware step, the ArrayBuilder
controls the process in a type-unaware way.
The ArrayBuilder
probably shouldn't live in the array subpackage because it isn't an Array subclass. It's conceptually closer to the Interpretations, but should probably be up in the laurelin package.
ArrayBuilder
does most of its work in its constructor. By this point, we already know the Interpretation
and entrystart
/entrystop
range, so there's no reason not to start reading. (If you want to delay reading, delay constructing the ArrayBuilder
.)
The basketEntryOffsets
are a given; they come directly from the TBranch metadata. First, we perform some sanity checks, then use them to determine the first and last TBasket touched by the entrystart
/entrystop
. Each of these TBaskets will launch a reading task.
We don't need very much information from the TBasket's TKey, just fKeylen, fLast, and fObjlen. However, getting this information means seeking to a part of the disk that hasn't been touched (loaded into the OS's VM cache) yet, which is slow enough that you don't want to do it for all TBranches, all TBaskets in the whole file. You want to do it for only those TBranches and TBaskets that are explicitly requested, and we don't know which ones those are until we've inspected the basketEntryOffsets
, so that's why ArrayBuilder
requests them through a callback.
If we're not going to be looking at any TBaskets at all, we call Interpretation.empty() and leave. Otherwise, we call those callbacks so that we know all the fKeylen
, fLast
, and fObjlen
values. The number of uncompressed bytes for the contents of a TBasket is fLast - fKeylen
, which is computed here. With this and the also-known numentries, we can ask the Interpretation for the number of items.
If this is a numeric array, the numitems
is the whole TBasket (i.e. equal to fObjlen
). If it's a jagged array, the numitems
corresponds to just the first part of the TBasket. The whole structure is as follows:
4 bytes 4 bytes
┌─────────┬────────────────────────────────┬───┬────────────┬───┐
│ TKey │ content │ X │ offsets │ x │
└─────────┴────────────────────────────────┴───┴────────────┴───┘
│← fLast - fKeylen →│ │
│ │
│← fObjlen →│
So this fLast - fKeylen
is also known (by me) as the "border
", separating the content
and offsets
of a jagged array. One way to determine if the TBasket represents jagged data is by comparing border == fObjlen
, but the other way is just to see if the Interpretation
is AsJagged
. This could be a sanity check, but isn't currently.
Now we have the "item offsets" and "entry offsets" of all relevant TBaskets, which we need to know before reading so this is a synchronization point: the GetBasket
callbacks can't be parallelized with the basket-reading tasks (though they could be parallelized before this synchronization point, but that's probably overkill). The item offsets specify the indexes where each TBasket starts and stops at the level of items (e.g. for a jagged content
) array, and the entry offsets specify where each TBasket starts and stops at the level of entries (e.g. for its offsets
). For a 1-dimensional numerical array, the item offsets and entry offsets would be the same.
Knowing the total number of items and the total number of entries, we can allocate a destination array. All the basket-reading tasks will fill this array, but they don't need to be synchronized because they'll be filling disjoint partitions (delineated by "item offsets" and "entry offsets").
At this point, we can now fork for parallelism: this loop launches the rest of the process in one task per TBasket. If the executor
is null
, then it runs each TBasket task in the main thread, in order. That's good for debugging.
Each task is a CallableFill instance with an unimportant return value (FutureTask) because what they do is 100% side-effect: they fill the destination array with data. We do need a list of these futures in order to ensure that they're done before somebody tries to read from this array using getArray
.
Although i
and j
are dummy variables, they acquired a meaning in the development of uproot: i
is the absolute TBasket index and j
is the index of relevant TBaskets; only the ones being read by this ArrayBuilder
. Thus, lists like basket_itemoffset
and basket_entryoffset
(the "item offsets" and "entry offsets" above) are indexed by j because we didn't receive any item/entry information about TBaskets we won't be reading, but requests to the getbasket
callback are indexed by i because that's global.
Looking at this now, I see three suspicious uses of i
, rather than j
:
- It would seem that
basket_entryoffset
should always be indexed byj
because it was created using only as many baskets as we're reading, and yet on these lines it's called withi
. It's possible that I confused thisbasket_entryoffset
with the globalbasketEntryOffsets
that comes directly from the TBranch metadata. I'm sorry about the very similar names; they were both adhering to conventions as they developed. - In fact, I also see a point where basketEntryOffsets is indexed with j, but it should always be indexed with
i
because it's a global thing. - Similarly to the first case,
basketkeys
is defined only for the TBaskets of interest, but in this block, it's indexed with i.
These are highly suspicious indexes—I don't know how it could get this confused, considering that it was a point of focus during development. In fact, I halfway remember fixing these issues, but I'm only looking at the master branch right now and maybe those fixes haven't gone in. The distinction between i
and j
only doesn't matter if entrystarts < len(first basket)
, such as entrystarts == 0
. Could it be that this is true of all the tests?
Getting back to the story, we have just entered CallableFill.call in each task. We set local_entrystart and local_entrystop as the entries of interest within this one TBasket. Unless this is the first or the last TBasket in the run, local_entrystart == 0
and local_entrystop == len(basket)
.
Then, if the data are NOT jagged, we read all the bytes and interpret it with fromroot. If the data ARE jagged, we read the content part and the offsets part separately, using the diagram above to determine the locations of the slices (i.e. avoiding the 4 bytes below and above the offsets). The whole basket is read (and decompressed), but the Interpretation
only needs to process from local_entrystart
to local_entrystop
, returning an array just big enough for the relevant entries.
(Note: the byteoffsets
drawn directly from the ROOT TBasket, illustrated in the diagram above, are starting positions in bytes for each entry relative to the beginning of the TKey. To get jagged array offsets, we have to subtract off fKeylen
and divide by itemsize
.)
Now we need to fill the destination array with the source array, putting the subset of data actually read into the right spot. Some data types (none in Laurelin's scope, I believe) don't know the exact number of items until after interpretation, so Interpretation.source_numitems potentially refines the estimate from earlier (always smaller; the preallocated array is big enough).
This curious block of code corrects the first and last basket for incomplete reads. If we're in the first TBasket of the group (j == 0
), then we might need to bump up the basket_itemoffset[0]
and basket_entryoffset[0]
so that we put this source into the destination where it will exactly abut the region filled by j == 1
. Similarly for the last TBasket (j + 1 == basketstop - basketstart
). This block of code only changes the first and last basket_itemoffset
and basket_entryoffset
in place, none of the middle ones, and there's no danger of a race condition because different tasks change different elements of these Java arrays.
Now that we know exactly where to copy the source data into the destination array, we do it.
The next two phases, "clip" and "finalize", require the whole array to be filled, so it's not done in CallableFill
. It's done in ArrayBuilder.getArray. That method blocks until all basket-tasks are done (the Future<Boolean>
result, always true
if the task completes at all, is ignored). Then, because getArray
is called knowing the rowId
and count
of interest, we don't clip
it down to basket_entryoffset[0]:basket_entryoffset[-1]
as uproot does, but to the precise rowId:rowId + count
requested by the calling code. We do need to shift rowId
by basket_entryoffset[0]
because rows start counting at the entrystart
of the read.
Strangely enough, there's no "finalize" step here at all. However, for the set of data types in scope for Laurelin, only AsJagged
would have made use of it, converting a counts
-based "prep" array into an offsets
-based JaggedArray
. That's not something we want to do more than once, and ArrayBuilder.getArray
is called many times by TTreeColumnVector, so I guess that's why the "finalize" phase was removed and JaggedArrayPrep
became the new JaggedArray
.
After that, the arrays are given up for TTreeColumnVector
and (if more than 1-dimensional) ArrayColumnVector
to look at. The strict calling convention imposed by Spark modified the direct copy-over-from-uproot that I had intended, but only in the last two steps.
As much as possible, I tried to use the same variable names and code layout as in uproot, so that the two would be comparable side-by-side. (This fact would be good to use to diagnose the suspicious i
versus j
indexing—which really rings a bell—I think I did exactly that and we're looking at reverted code here!)
The equivalent of ArrayBuilder
is in uproot/tree.py and the interpretations are all in uproot/interp. Specifically, the entire lifecycle of a TBranch read is in TBranchMethods.array and the interpretations implemented in Laurelin are uproot.interp.asdtype and its superclass, _asnumeric and uproot.interp.asjagged.
After recovering bad TBaskets and normalizing user input (two things Laurelin doesn't do), uproot finds the basketstart and basketstop for this entrystart and entrystop and computes basket_itemoffset and basket_entryoffset, just like ArrayBuilder
(with the same names). In between, it does some things Laurelin doesn't do, like requesting remote protocols to preload the basket byte ranges and interacting with a user-provided cache, but the early return of Interpretation.empty is something that Laurelin does.
Once all of the basket_itemoffset
and basket_entryoffset
boundaries are known, uproot allocates the destination array, just like Laurelin. The forking into parallel tasks is done by defining a fill(j) function in uproot, which is the analog of CallableFill
in Java. Much like the basket_itemoffset
/basket_entryoffset
calculation, uproot calls a function to compute local_entrystart, local_entrystop instead of putting the logic inline as it's done in Laurelin, so in uproot, you have to chase those private method definitions.
Similarly, the whole logic of reading a basket from ROOT is in one method call in uproot, which takes you to TBranchMethods._basket. This slices up the TBasket into its content and offsets, if applicable, but that looks a little different when written with Numpy arrays. There's also a bit of noise regarding uproot's user-supplied basketcache
and getting a key in a thread-safe way with a user-supplied keycache, but the latter can be regarded as a stand-in for the GetBasket
callback. As in Laurelin, this method calls interpretation.fromroot to actually turn the ROOT bytes into a source array.
Back to TBranchMethods.array
, the next block looks exactly as it does in ArrayBuilder
: adjusting the basket_itemoffset and entry_itemoffset for the first and last TBasket not being complete. And then we fill the source array into the destination array.
Putting tasks in an executor is similar in Python as it is in Java, though I have to take some care to be sure that exceptions raised in the threads make their way onto the main thread (by passing it as a return value).
Then the rest of the function is the "clip" and "finalize" phases inside a wait
function, which I use as a simple futures type. (Calling this zero-argument function blocks until all tasks are done, via for excinfo in excinfos: ...
, and then performs the "clip" and "finalize". I can chain these futures by delaying calls to the wait
functions at a higher level.)
Similar parts of this logic can be found in TBasketMethods._step_array, which is optimized for use in uproot.iterate
, but all other array-fetching functions go through TBasketMethods.array
.
The uproot.interp.asdtype
and uproot.interp.asjagged
implementations are even more similar to the Laurelin ones, with all the same method names and behaviors.
The end!