-
Notifications
You must be signed in to change notification settings - Fork 14
S4 Migration
This document attempts to list some of the places where we want to solidify conventions that have emerged in the optmatch codebase as formal classes or opportunities to use classes and methods to make optmatch easier for end users. See the s4
branch for work in progress.
The first step of bipartite matching is creating a representation of the distances between treatment and control units. The canonical representation in the literature is a control by treatment matrix of entries, some of which may have infinite value representing unmatchable pairs. We might replace that matrix by a sparse representation where only finite entries are stored. If we consider the problem as a network with edges weighted by distance, we might break the problem into connected components, each of which is one of the matrix representations. The question arises how to handle all of these representations with an eye towards efficiency of space and computation.
The matrix
class already exists and is used throughout Optmatch. Finite entries indicate distance, infinite entries indicate non-matchable pairs. For small or dense problems (i.e. most entries are finite), it is hard to improve upon the basic matrix class.
For more sparse problems (for example, stratifying on one or more blocking factors), a sparser representation is called for. The new S4 class InfinitySparseMatrix
represents these problems, and is the default return type from functions that create sparse descriptions (e.g. caliper
). The class does not support matrix algebra found in the SparseM
or Matrix
packages, but we suspect this will not be readily needed by people creating distance matrices. It does support methods for manipulation such as subset
, cbind
, rbind
, t
, and element-wise arithmetic. Coercion from and to matrices is also supported via as.InfinitySparseMatrix
and as.matrix
, respectively. These matrices support row and column names. Joining operations give precedence to names, such that two ISMs with the same row names in different order will create the appropriate cbind
ed matrix. The class inherits from numeric
so scalar operations work on the finite entires, as do functions like mean
and sum
that process vectors.
While subject to change, the internal representation of the ISM class is composed of the following slots:
-
.Data
holds the finite entries. This is accessed by default for scalar operations, e.g.2 * x
. -
cols
is a vector of column ids the same length as.Data
-
rows
likewise for row ids. Betweencols
androws
, we know the location of each item in.Data
-
rownames
holds the names of the rows (optional) -
colnames
holds the names of the columns (optional) -
dimension
is a two element list of the number of rows and columns. Usually set automatically during creation.
For sparse representations we could in principle use classes from the SparseM or Matrix packages; however, our running impression is that their orientations are different enough from ours that it would be better to create our own class or classes. (We are not doing very heavy matrix algebra. It does not appear that SparseM supports row/column names, while Matrix does. These names are nice to have for later joining of the matching to the original data.)
The S3 optmatch.dlist
class has been deprecated.
DistanceSpecification
is a "class union" (see Chambers (2008) chapter 9 for more details). This union formalizes the fact that either a matrix
or InfinitySparseMatrix
can serve as a distance specification. This class union acts like a normal class and can serve as the indicator for dispatch for methods or as a slot in an S4 object. If a class is part of the union, it must also support the following operations.
-
prepareMatching(x)
turns a distance specification into a "canonical matching form." While not a formal class, the canonical form is adata.frame
with 3 columns:control
,treatment
,distance
.
exactMatch
is a generic function for producing InfinitySparseMatrix
objects representing stratified or exact matches. There are currently two methods.
-
exactMatch(B, Z)
whereB
is a factor andZ
is a two level treatment indicator.B
andZ
must be the same length. Treatment-control pairs that have the same level in B receive a zero in the resulting matrix, otherwise the pair gets an infinite entry. -
exactMatch(Z ~ B, [data = a.data.frame])
. Like the previous example, except that it uses a formula specification and an optionaldata.frame
that contains the vectorsZ
andB
. Formulas of the formZ ~ B1 + B2 + B3 ...
stratify on the interaction of all the blocking factors.
The results of exactMatch
can be added to an existing distance specification or used as the excludes
argument to mdist
(see below for more details).
The mdist
function formerly took an argument structure.fmla
that allowed creating distances stratified by blocking factors. This argument has been replaced in favor of a new format mdist(..., excludes = aInfinitySparseMatrix)
. This allows more flexible limits on allowed matched differences: mdist(..., excludes = caliper(....)
. The old behavior of mdist(..., data = my.data, structure.fmla = Z ~ B)
is equivalent to mdist(..., excludes = exactMatch(Z ~ B, data = my.data))
.
This class has been deprecated in favor of the InfinitySparseMatrix
class. Users should not notice a difference if they do not manipulate the objects directly. mdist
and caliper
, as well as the matching functions, have been updated to work with the new objects directly.
Use the glm
method and formula
methods for mdist
instead.
Logical statements of InfinitySparseMatrix
objects have a slightly different interpretation than optmatch.dlist
objects. Since the InfinitySparseMatrix
class descends from a numeric vector, logical operators result in a logical vector (not list of matrices of 1s and 0s as was the case with optmatch.dlist
objects). Here is an example illustrating the change, from tests/fullmatch.R
:
data(nuclearplants)
mhd2 <- mahal.dist(~date+cum.n, nuclearplants, pr~pt)
fullmatch(mhd2 < 1)
data(nuclearplants)
mhd2 <- mdist(pr ~ date + cum.n, data = nuclearplants,
exclusions = exactMatch(pr ~ pt, data = nuclearplants))
mhd2[mhd2 < 1] <- 1
mhd2[mhd2 >= 1] <- 0
fullmatch(mhd2)
If you are using logical operators, you are strongerly encouraged to consider the caliper
function instead.
The subclass.indices
argument, formerly marked deprecated, has been removed.