-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate size and capacity of LGDO ArrayLike objects #109
base: main
Are you sure you want to change the base?
Conversation
… changes to len/capacity management
… instead of returning n_read
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #109 +/- ##
==========================================
+ Coverage 78.30% 78.59% +0.28%
==========================================
Files 46 46
Lines 3384 3467 +83
==========================================
+ Hits 2650 2725 +75
- Misses 734 742 +8 ☔ View full report in Codecov by Sentry. |
I had a chance to test the performance change on perlmutter. I tested by reading data from 856 files, totalling 5.3 GB read into memory. For my test, I did a single timed read followed by a %timeit read, to differentiate the first read, which is dominated by file caching. Here are my findings: In terms of memory usage, the total memory used while reading was quite similar (during copies additional memory needs to be allocated); when all is said and done, the new version was using ~8 GB while the old was using 5.3 GB. This could be remedied by trimming the arrays. So, in the end there is a performance improvement, but it is small relative to the time spent reading files. This is especially true on first read where we would probably see a ~5% improvement in speed (in most cases this will be all that matters). It should be noted that the total time spent allocating/copying memory scales as O(nlogn) with changes and O(n^2) pre-change, so if we continued to read in more data the timing discrepancy could grow enough that the time copying is comparable to the time reading. |
See: #107
Added an Abstract Base Class for LGDOContainers, which are ArrayLike objects and can be placed inside of tables. LGDOContainers are expected to implement methods for manipulating the capacity and size of the containers. Capacity controls the number of rows in memory, while size controls the length of the container; capacity may be larger than size, resulting in "padding" at the end. The advantage of this approach is that it enables changes in size without reallocating memory (always for shrinking, most of the time for growing). The cost is the extra memory used by this padding (although resizes require allocating a new array and then copying data over, requiring a spike in memory usage, so this cost is not so big). Methods and options have been added for "trimming" the capacity to match the size of the array in case the user prefers to not have this padding for some reason.
As a result of this change, the returned values for
core.read
andstore.read
have (sometimes) changed. Before, when reading an LGDO object in place, they would return the number of rows read. Now, they resize the array toobj_buf_start + n_rows_read
. The reason for the old behavior was that we did not want to shrink arrays when not enough rows were in the file to fill a Table in place, since the reallocation could cause problems. Since shrinking no longer requires reallocation, this is no longer problematic. A major benefit is that the read functions are both simpler to use since they have consistent returns. It should be noted that this change will break code in other LEGEND packages.When resizing, if the capacity is smaller than the new size, the behavior is to reserve more capacity up to the next power of 2 larger than the new size. When reading a huge number of files, this will reduce the time spent reallocating/copying arrays from O(n) to O(log(n)); it is not clear to me how important this is, but someone can try running tests of reading a large number of files to check.
Detailed changes:
reserve_capacity
,get_capacity
, andtrim_capacity
defined. In some cases, this involved implementing other methods likeappend
andinsert
.nda
is now a property, which returns a slice (of lengthlen(obj)
) of the internal data buffer (which has length equal to capacity). This will preserve the old behavior as far as the user is concerneddtype
was turned into a property in Array, AoESA and VoV.n_dims
was for VoVcore.read
resizes the returned array to sizeobj_buf_start + n_rows_read
. Note that this resize could probably be done in the_serializer.read
functions in the future (which still returnn_rows_read
)store.read
has been simplified, and now just callscore.read
aftergimme_files
. This reduces redundant codelh5.iterator
no longer returnsn_rows_read
n_rows_read
. Most of them were filling it into_
; for the ones that actually had asserts onn_rows_read
, these were replaced with asserts onlen(obj)
loc
, which was a member variable that was more or less used to enable the size to be different from capacity before; since this is now a core feature, this is unnecessary. It does not seem to play any role in our other code packagesLGDOContainer
methods for consistency