-
Notifications
You must be signed in to change notification settings - Fork 14
ISM elaborations
ISMs (and mISMs) are a little big and a little slow. Here's a page for considering elaborations on the class.
Three interrelated ideas:
- Create class extension or slot within existing classes to indicate ISMs whose entries are stored in row-major order.
- For ISMs/bISMs with entries in row-major order, instead of storing a rows slot with lots of repeats, eg.
c(1,1,2,2,2,4)
, store an integer vector of length equal to the number of rows in the ISM (stored indim
slot) that simply records how many finite entries there are in each row --c(2,3,0,1)
for the example just given. (If you need to regenerate the long formrows
vector, pass this one as the second argument, "shortrows
", inrep(1:theISM@dim[1], shortrows)
.)- To convert from repeat-version to count-version,
tabulate
will be useful, e.g.tabulate(theISM@rows, nbins = theISM@dim[1])
. (The use of thenbins
argument covers cases where the bottom row(s) may be unmatchable.)
- To convert from repeat-version to count-version,
- Add structure to the ISM in order to quickly determine whether 2 ISMs have finite entries at the same matrix positions, stored in the same order (so that adding the ISMs is just adding the entries).
@josherrickson's functions sort.BlockedInfinitySparseMatrix
and sort.InfinitySparseMatrix
, introduced in 06cfe7e, could be adapted to convert ISMs and bISMs into elaborated structures that follow row-major order and condense the representation of the rows. When we found ourselves combining 2 of this new type of structure, we'd first check to see whether they have the same finite entries at the same matrix positions. If so, then the requested arithmetic is just done on both ISMs' data
. If not, then we'd convert to long format and use existing ISM/bISM routines to combine them.
E.g., by hashing those positions. Presently we hash ISMs elsewhere in optmatch, without using the result for very much -- we might adapt and/or repurpose that work.
Some of the options:
a. Just hash the arc set, not the distances. Ditch optmatch
objects' "hashed.distances"
attributes. Associate w/ optmatch object under a new name, eg attr(mout, "hashed.arcset")
b. Hash the arc set as well as the ISM as a whole. Keep optmatchobjects'
"hashed.distances"attributes; add a
attr(mout, "hashed.arcset"). c. Separately hash arc set and associated distances, ie the "data" piece of the ISM. Store a vector of two hashes, rather than one, under
attr(mout, "hashed.distance"). d. Don't bother hashing arc sets. Instead, when in question just compare 2 arc sets to see if they're identical. Continue hashing ISMs and bISMs and storing result under
attr(mout, "hashed.distance")`, just as before.
Of these I (BH) tend toward (c) -- but before embarking on any of this, we might see how much time it takes to compare rows
and cols
slots directly.