-
Notifications
You must be signed in to change notification settings - Fork 14
Implementing non standard matching distances
Suppose you wanted squared Mahalanobis distance. Something like this would give you matching on the ordinary Mahalanobis distance ---
pairmatch(Z~X1+X2+X3, data = my_data_frame)
--- but there's no built-in method of computing squared Mahalanobis.
The answer is
- separate distance formation from matching (if you haven't done this already), and
- square the distance once you've formed it.
E.g.,
d1 <- match_on(Z~X1+X2+X3, data = my_data_frame, method = "mahalanobis")
pairmatch(d1^2, data=my_data_frame)
Arithmetic operations on pairs of distances are also a thing. You might give added weight to X1
via
d0 <- match_on(Z~X1, data = my_data_frame)
pairmatch(d1 + .1*d0, data = my_data_frame)
Or, how to write your own compute_
method.
The formula method of the match_on
generic function provides users a way to specify their matching problems using a standard R
formula and an optional distance calculation method. For example:
match_on(Z ~ X1 + X2 + X3, data = my.data.frame, method = "mahalanobis")
Which provides a treatment by control matrix of distances computed using the Mahalanobis distance (Euclidean distance scaled by the covariances of the dimensions) between treated and control units on the variables X1
, X2
and X3
. Implicitly this makes use of the build-in function optmatch:::compute_mahalanobis
, which we call a "compute_
" method.
Users can specify other compute_
methods, such as the "euclidean" and "rank_mahalanobis" built-in options. To provide your own computing_
method, you should implement a function with the following signature:
f(index, data, z) -> vector of distances
Where z
will be the treatment indicator specified in the user's formula, and data
will be the model frame that is created from the formula, either passed directly from the environment or the user. The index
argument is a two column matrix, where the first column gives the row id or name of the treated units in the data
argument and the second column is the control unit id or row name. There is one row in index for each valid comparison (e.g. prohibiting comparisons of exactMatch
problems or calipers). For each of these pairs, the function should return the distance between the treated unit and control unit.
To illustrate, here is a compute function that generates the maximum absolute difference on any variable specified in the formula:
compute_maximum <- function(index, data, z) {
apply(index, 1, function(tcpair) {
max(abs(data[tcpair[1], ] - data[tcpair[2], ]))
})
}
In this example, the function does not require the use of the z
argument, but it can be helpful when the function needs to compute treated and control standard deviations, for example.
To include your own distance function when computing distances, call match_on
, passing your function f
as the `method argument:
mydist <- match_on(z ~ X1 + X2, data = mydata, method = compute_maximum)
Because the ...
arguments to pairmatch()
and fullmatch()
are passed down to match_on
, this means you can also do e.g.
mypairs <- pairmatch(z ~ X1 + X2, data = mydata, method = compute_maximum)
(skipping the intermediate step of explicitly creating and storing mydist
).