Skip to content
markmfredrickson edited this page Jul 13, 2011 · 17 revisions

This document attempts to list some of the places where we want to solidify conventions that have emerged in the optmatch codebase as formal classes or opportunities to use classes and methods to make optmatch easier for end users. See the s4 branch for work in progress.

Distance Specifications

The first step of bipartite matching is creating a representation of the distances between treatment and control units. The canonical representation in the literature is a control by treatment matrix of entries, some of which may have infinite value representing unmatchable pairs. We might replace that matrix by a sparse representation where only finite entries are stored. If we consider the problem as a network with edges weighted by distance, we might break the problem into connected components, each of which is one of the matrix representations. The question arises how to handle all of these representations with an eye towards efficiency of space and computation.

Distance matrices

The matrix class already exists and is used throughout Optmatch. Finite entries indicate distance, infinite entries indicate non-matchable pairs. For small or dense problems (i.e. most entries are finite), it is hard to improve upon the basic matrix class.

For more sparse problems (for example, stratifying on one or more blocking factors), a sparser representation is called for. The new S4 class InfinitySparseMatrix represents these problems, and is the default return type from functions that create sparse descriptions (e.g. caliper). The class does not support matrix algebra found in the SparseM or Matrix packages, but we suspect this will not be readily needed by people creating distance matrices. It does support methods for manipulation such as subset, cbind, rbind, t, and element-wise arithmetic. Coercion from and to matrices is also supported via as.InfinitySparseMatrix and as.matrix, respectively. These matrices support row and column names. Joining operations give precedence to names, such that two ISMs with the same row names in different order will create the appropriate cbinded matrix. The class inherits from numeric so scalar operations work on the finite entires, as do functions like mean and sum that process vectors.

While subject to change, the internal representation of the ISM class is composed of the following slots:

  • .Data holds the finite entries. This is accessed by default for scalar operations, e.g. 2 * x.
  • cols is a vector of column ids the same length as .Data
  • rows likewise for row ids. Between cols and rows, we know the location of each item in .Data
  • rownames holds the names of the rows (optional)
  • colnames holds the names of the columns (optional)
  • dimension is a two element list of the number of rows and columns. Usually set automatically during creation.

For sparse representations we could in principle use classes from the SparseM or Matrix packages; however, our running impression is that their orientations are different enough from ours that it would be better to create our own class or classes. (We are not doing very heavy matrix algebra. It does not appear that SparseM supports row/column names, while Matrix does. These names are nice to have for later joining of the matching to the original data.)

The S3 optmatch.dlist class has been deprecated.

DistanceSpecifcation

DistanceSpecification is a "class union" (see Chambers (2008) chapter 9 for more details). This union formalizes the fact that either a matrix or InfinitySparseMatrix can serve as a distance specification. This class union acts like a normal class and can serve as the indicator for dispatch for methods or as a slot in an S4 object. If a class is part of the union, it must also support the following operations.

  • prepareMatching(x) turns a distance specification into a "canonical matching form." While not a formal class, the canonical form is a data.frame with 3 columns: control, treatment, distance.

New Functions

exactMatch

exactMatch is a generic function for producing InfinitySparseMatrix objects representing stratified or exact matches. There are currently two methods.

  • exactMatch(B, Z) where B is a factor and Z is a two level treatment indicator. B and Z must be the same length. Treatment-control pairs that have the same level in B receive a zero in the resulting matrix, otherwise the pair gets an infinite entry.
  • exactMatch(Z ~ B, [data = a.data.frame]). Like the previous example, except that it uses a formula specification and an optional data.frame that contains the vectors Z and B. Formulas of the form Z ~ B1 + B2 + B3 ... stratify on the interaction of all the blocking factors.

The results of exactMatch can be added to an existing distance specification or used as the excludes argument to mdist (see below for more details).

Deprecated Functions and API changes

mdist: excludes replaces structure.fmla

The mdist function formerly took an argument structure.fmla that allowed creating distances stratified by blocking factors. This argument has been replaced in favor of a new format mdist(..., excludes = aInfinitySparseMatrix). This allows more flexible limits on allowed matched differences: mdist(..., excludes = caliper(....). The old behavior of mdist(..., data = my.data, structure.fmla = Z ~ B) is equivalent to mdist(..., excludes = exactMatch(Z ~ B, data = my.data)).

The optmatch.dlist class

This class has been deprecated in favor of the InfinitySparseMatrix class. Users should not notice a difference if they do not manipulate the objects directly. mdist and caliper, as well as the matching functions, have been updated to work with the new objects directly.

pscore.dist and mahal.dist functions removed

Use the glm method and formula methods for mdist instead.

Logic operators

Logical statements of InfinitySparseMatrix objects have a slightly different interpretation than optmatch.dlist objects. Since the InfinitySparseMatrix class descends from a numeric vector, logical operators result in a logical vector (not list of matrices of 1s and 0s as was the case with optmatch.dlist objects). Here is an example illustrating the change, from tests/fullmatch.R:

Old

data(nuclearplants)
mhd2 <- mahal.dist(~date+cum.n, nuclearplants, pr~pt)
fullmatch(mhd2 < 1)

New

data(nuclearplants)
mhd2 <- mdist(pr ~ date + cum.n, data = nuclearplants, 
              exclusions = exactMatch(pr ~ pt, data = nuclearplants))
mhd2[mhd2 < 1] <- 1
mhd2[mhd2 >= 1] <- 0
fullmatch(mhd2)

If you are using logical operators, you are strongerly encouraged to consider the caliper function instead.