Skip to content

Implementing non standard matching distances

Ben edited this page Aug 2, 2017 · 2 revisions

Arithmetic transformations or combinations of built-in distances

Suppose you wanted squared Mahalanobis distance. Something like this would give you matching on the ordinary Mahalanobis distance ---

pairmatch(Z~X1+X2+X3, data = my_data_frame)

--- but there's no built-in method of computing squared Mahalanobis.

The answer is

  1. separate distance formation from matching (if you haven't done this already), and
  2. square the distance once you've formed it.

E.g.,

d1 <- match_on(Z~X1+X2+X3, data = my_data_frame, method = "mahalanobis")
pairmatch(d1^2, data=my_data_frame)

Arithmetic operations on pairs of distances are also a thing. You might give added weight to X1 via

d0 <- match_on(Z~X1, data = my_data_frame)
pairmatch(d1 + .1*d0, data = my_data_frame)

Distances that can't be assembled from built-in distances by arithmetic operations

Or, how to write your own compute_ method.

The formula method of the match_on generic function provides users a way to specify their matching problems using a standard R formula and an optional distance calculation method. For example:

match_on(Z ~ X1 + X2 + X3, data = my.data.frame, method = "mahalanobis")

Which provides a treatment by control matrix of distances computed using the Mahalanobis distance (Euclidean distance scaled by the covariances of the dimensions) between treated and control units on the variables X1, X2 and X3. Implicitly this makes use of the build-in function optmatch:::compute_mahalanobis, which we call a "compute_" method.

Users can specify other compute_ methods, such as the "euclidean" and "rank_mahalanobis" built-in options. To provide your own computing_ method, you should implement a function with the following signature:

f(index, data, z) -> vector of distances

Where z will be the treatment indicator specified in the user's formula, and data will be the model frame that is created from the formula, either passed directly from the environment or the user. The index argument is a two column matrix, where the first column gives the row id or name of the treated units in the data argument and the second column is the control unit id or row name. There is one row in index for each valid comparison (e.g. prohibiting comparisons of exactMatch problems or calipers). For each of these pairs, the function should return the distance between the treated unit and control unit.

To illustrate, here is a compute function that generates the maximum absolute difference on any variable specified in the formula:

compute_maximum <- function(index, data, z) {
    apply(index, 1, function(tcpair) {
      max(abs(data[tcpair[1], ] - data[tcpair[2], ])) 
    })
}

In this example, the function does not require the use of the z argument, but it can be helpful when the function needs to compute treated and control standard deviations, for example.

To include your own distance function when computing distances, call match_on, passing your function f as the `method argument:

mydist <- match_on(z ~ X1 + X2, data = mydata, method = compute_maximum)

Because the ... arguments to pairmatch() and fullmatch() are passed down to match_on, this means you can also do e.g.

mypairs <- pairmatch(z ~ X1 + X2, data = mydata, method = compute_maximum)

(skipping the intermediate step of explicitly creating and storing mydist).