Skip to content
benthestatistician edited this page Jun 14, 2011 · 17 revisions

This document attempts to list some of the places where we want to solidify conventions that have emerged in the optmatch codebase as formal classes or opportunities to use classes and methods to make optmatch easier for end users. See the s4 branch for work in progress.

Distance Specifications

The first step of bipartite matching is creating a representation of the distances between treatment and control units. The canonical representation in the literature is a control by treatment matrix of entries, some of which may have infinite value representing unmatchable pairs. We might replace that matrix by a sparse representation where only finite entries are stored. If we consider the problem as a network with edges weighted by distance, we might break the problem into connected components, each of which is one of the matrix representations. The question arises how to handle all of these representations with an eye towards efficiency of space and computation.

Classes

The matrix class already exists. For sparse representations we could in principle use classes from the SparseM or Matrix packages; however, our running impression is that their orientations are different enough from ours that it would be better to create our own class or classes. (We are not doing very heavy matrix algebra. It does not appear that SparseM supports row/column names, while Matrix does. These names are nice to have for later joining of the matching to the original data.)

At present (14 June 2011), our plan is to mimic the basics of SparseM's csr structure to in effect represent matching prohibitions and discrepancies in a 3 column data.frame: Column one is the control name, column two is the treatment name, and column three is the distance. This is potentially slightly more verbose than the SparseM representation, but would scale up better than a dense matrix for large, sparse problems. It would be more of a headache to work with when doing addition (e.g. have to look up the control and treatment id). This representation is what I am using for the output value of prepareMatching (see below).

The S3 style optmatch.dlist class could be replaced by an S4 class that descends from list or namedList.

Class Unions

One interesting tool at our disposal is R's "class union" (see Chambers (2008) chapter 9 for more details). A class union would allow us to mark any data type as a "DistanceSpecification," that is one of the representations above. From what I can tell, this is mostly useful composing a DistanceSpecification into another S4 object. Any of the allowable representations would could be put in the slot. The exact code might look something like (using the old S3 class name):

setClassUnion("DistanceSpecification", c("matrix", "matrix.csr", "optmatch.dlist")

Methods

To my knowledge, R has no concept similar to Java's interfaces or Clojure's protocols: sets of methods that implementing objects are required to have. We will have to reply on testing to make sure that all methods for the various distance specifications above are implemented for each concrete type. Where possible, we may be able to write generic methods that refer to the class union.

Preparing for matching

The Fortran code that performs the network flow algorithm takes as a list of nodes and list of edges as inputs. Right now, that activity is hard coded in fmatch(). I purpose a method prepareMatching for each of the supported types that returns these lists. Checking these lists for correctness will still be the responsibility of the original fmatch function. The current fmatch function is most of the implementation of prepareMatching.matrix.

Combining distances

These objects should be combinable within and across concrete types using cbind, rbind, etc. So far, we mostly have questions.

  1. To what degree can this be done with coercion from one format to another? We do not want to instantiate potentially large sparse representations.
  2. When do we want to treat missing entries as infinite and when do we want to throw errors?

Both matrix and matrix.csr support cbind and rbind within the same class. Trying to bind the two classes, however, fails.

Calipers and arithmetic

Again, more questions than answers so far.

  1. Like the combining methods, how do we work across classes?
  2. Can we use dispatch on both arguments to make the code cleaner? E.g. if one argument is sparse (either the first or second), we want to be careful about expanding the data.

Adding a finite matrix and a matrix.csr is successful. However, matrices with infinite entries fail. We could overcome this issue by turning all infinite entires to zero and adding. This of course begs the next question, how to make zeroness sticky -- that is 1 + 0 = 0, for the purposes of the sparse matrix. We would need to use the original infinite entries in both the matrices to create a flag of values to remove.

For calipers, we could replace the + notation with * notation. e.g. dist1 * caliper(...). If the caliper returned a 0/1 matrix of valid entries, the operation would be successful. This would probably require changing the handling of non-matches for all representations. Plain matrices would also have to represent non-matches as zero (e.g. if dist1 above was a large but dense matrix)

Summaries and statistics

Any functions that express summary information should turned into generic functions with methods for the different types.

Lazy Evaluation

Can we use lazy evaluation and short circuiting to avoid computing distances that are later removed by a caliper? For example, if a caliper sets an entry to infinity based on an exact matching requirement, can we avoid spending computations on computing distances that will eventually be turned into infinity?

Matches

Matches are currently a S3 style factor class with some required attributes. We should make this more formal with an S4 definition that has slots for the old attributes.