-
Notifications
You must be signed in to change notification settings - Fork 14
S4 Migration
This document attempts to list some of the places where we want to solidify conventions that have emerged in the optmatch codebase as formal classes or opportunities to use classes and methods to make optmatch easier for end users. See the s4
branch for work in progress.
The first step of bipartite matching is creating a representation of the distances between treatment and control units. The canonical representation in the literature is a control by treatment matrix of entires, some of which may have infinite value representing unmatchable pairs. We might replace that matrix by a sparse representation where only finite entries are stored. If we consider the problem as a network with edges weighted by distance, we might break the problem into connected components, each of which is one of the matrix representations. The question arises how to handle all of these representations with an eye towards efficiency of space and computation.
The matrix
class already exists.
For sparse matrices, we can use the SparseM and/or package, though we may need to treat infinite entires as zero for the purposes of compressing the matrix. Real zero values could be replaced with machine tolerance values. As an alternative to a sparse matrix, we could create a "canonical" representation of the matching problem that is a 3 column data.frame. Column one is the control name, column two is the treatment name, and column three is the distance. This is potentially slightly more verbose than the SparseM representation, but would scale up better than a dense matrix for large, sparse problems. It would be more of a headache to work with when doing addition (e.g. have to look up the control and treatment id). This representation is what I am using for the output value of prepareMatching
(see below).
It seems like good politics and would be beneficial to our users to support both SparseM and Matrix. We are not doing very heavy matrix algebra, so we can probably write common methods for both packages, perhaps abstracting over a few of the details. At a minimum, it does not appear that SparseM supports row/column names, while Matrix does. These names are nice to have for later joining of the matching to the original data.
The S3 style optmatch.dlist
class could be replaced by an S4 class that descends from list
or namedList
.
One interesting tool at our disposal is R's "class union" (see Chambers (2008) chapter 9 for more details). A class union would allow us to mark any data type as a "DistanceSpecification," that is one of the representations above. From what I can tell, this is mostly useful composing a DistanceSpecification into another S4 object. Any of the allowable representations would could be put in the slot. The exact code might look something like (using the old S3 class name):
setClassUnion("DistanceSpecification", c("matrix", "matrix.csr", "optmatch.dlist")
To my knowledge, R has no concept similar to Java's interfaces or Clojure's protocols: sets of methods that implementing objects are required to have. We will have to reply on testing to make sure that all methods for the various distance specifications above are implemented for each concrete type. Where possible, we may be able to write generic methods that refer to the class union.
The Fortran code that performs the network flow algorithm takes as a list of nodes and list of edges as inputs. Right now, that activity is hard coded in fmatch()
. I purpose a method prepareMatching
for each of the supported types that returns these lists. Checking these lists for correctness will still be the responsibility of the original fmatch function. The current fmatch function is most of the implementation of prepareMatching.matrix
.
These objects should be combinable within and across concrete types using cbind, rbind, etc. So far, we mostly have questions.
- To what degree can this be done with coercion from one format to another? We do not want to insatiate potentially large sparse representations.
- When do we want to treat missing entries as infinite and when do we want to throw errors?
Again, more questions than answers so far.
- Like the combining methods, how do we work across classes?
- Can we use dispatch on both arguments to make the code cleaner? E.g. if one argument is sparse (either the first or second), we want to be careful about expanding the data.
Any functions that express summary information should turned into generic functions with methods for the different types.
Can we use lazy evaluation and short circuiting to avoid computing distances that are later removed by a caliper? For example, if a caliper sets an entry to infinity based on an exact matching requirement, can we avoid spending computations on computing distances that will eventually be turned into infinity?
Matches are currently a S3 style factor class with some required attributes. We should make this more formal with an S4 definition that has slots for the old attributes.