Skip to content
Josh Errickson edited this page Dec 26, 2015 · 2 revisions

ISMs (and mISMs) are a little big and a little slow. Here's a page for considering elaborations on the class.

Jist of proposal(s)

Three interrelated ideas:

  1. Create class extension or slot within existing classes to indicate ISMs whose entries are stored in row-major order.
  2. For ISMs/bISMs with entries in row-major order, instead of storing a rows slot with lots of repeats, eg. c(1,1,2,2,2,4), store an integer vector of length equal to the number of rows in the ISM (stored in dim slot) that simply records how many finite entries there are in each row -- c(2,3,0,1) for the example just given. (If you need to regenerate the long form rows vector, pass this one as the second argument, "shortrows", in rep(1:theISM@dim[1], shortrows).)
    • To convert from repeat-version to count-version, tabulate will be useful, e.g. tabulate(theISM@rows, nbins = theISM@dim[1]). (The use of the nbins argument covers cases where the bottom row(s) may be unmatchable.)
  3. Add structure to the ISM in order to quickly determine whether 2 ISMs have finite entries at the same matrix positions, stored in the same order (so that adding the ISMs is just adding the entries).

More details

row-major ordered ISMs

@josherrickson's functions sort.BlockedInfinitySparseMatrix and sort.InfinitySparseMatrix, introduced in 06cfe7e, could be adapted to convert ISMs and bISMs into elaborated structures that follow row-major order and condense the representation of the rows. When we found ourselves combining 2 of this new type of structure, we'd first check to see whether they have the same finite entries at the same matrix positions. If so, then the requested arithmetic is just done on both ISMs' data. If not, then we'd convert to long format and use existing ISM/bISM routines to combine them.

Quick determination of whether ISMs have finite entries at same positions

E.g., by hashing those positions. Presently we hash ISMs elsewhere in optmatch, without using the result for very much -- we might adapt and/or repurpose that work.

Some of the options: a. Just hash the arc set, not the distances. Ditch optmatch objects' "hashed.distances" attributes. Associate w/ optmatch object under a new name, eg attr(mout, "hashed.arcset") b. Hash the arc set as well as the ISM as a whole. Keep optmatchobjects'"hashed.distances"attributes; add aattr(mout, "hashed.arcset"). c. Separately hash arc set and associated distances, ie the "data" piece of the ISM. Store a vector of two hashes, rather than one, under attr(mout, "hashed.distance"). d. Don't bother hashing arc sets. Instead, when in question just compare 2 arc sets to see if they're identical. Continue hashing ISMs and bISMs and storing result under attr(mout, "hashed.distance")`, just as before.

Of these I (BH) tend toward (c) -- but before embarking on any of this, we might see how much time it takes to compare rows and cols slots directly.