-
Notifications
You must be signed in to change notification settings - Fork 12
/
32-Spatially-Continuous-Data-II.Rmd
421 lines (323 loc) · 21.4 KB
/
32-Spatially-Continuous-Data-II.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
---
title: "Spatially Continuous Data II"
output: html_notebook
---
# Spatially Continuous Data II
*NOTE*: You can download the source files for this book from [here](https://github.com/paezha/Spatial-Statistics-Course). The source files are in the format of R Notebooks. Notebooks are pretty neat, because the allow you execute code within the notebook, so that you can work interactively with the notes.
Previously, you learned about the analysis of area data. Starting with this practice, you will be introduced to another type of spatial data: continuous data, also called fields.
If you wish to work interactively with this chapter you will need the following:
* An R markdown notebook version of this document (the source file).
* A package called `geog4ga3`.
## Learning objectives
In the previous practice you were introduced to the concept of fields/spatially continuous data. Three different approaches were discussed that can be used to convert a set of observations of a field at discrete locations into a surface, namely tile-based approaches, inverse distance weighting (IDW), and $k$-point means. In this practice, you will learn:
1. About intervals of confidence for predictions.
2. Using trend surface analysis as an interpolation tool.
3. The difference between accuracy and precision in interpolation.
## Suggested readings
- Bailey TC and Gatrell AC [-@Bailey1995] Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
- Bivand RS, Pebesma E, and Gomez-Rubio V [-@Bivand2008] Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
- Brunsdon C and Comber L [-@Brunsdon2015R] An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
- Isaaks EH and Srivastava RM [-@Isaaks1989applied] An Introduction to Applied Geostatistics, **CHAPTERS**. Oxford University Press: Oxford.
- O'Sullivan D and Unwin D [-@Osullivan2010] Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.
## Preliminaries
As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in `R` to clear the workspace is `rm` (for "remove"), followed by a list of items to be removed. To clear the workspace from _all_ objects, do the following:
```{r}
rm(list = ls())
```
Note that `ls()` lists all objects currently on the workspace.
Load the libraries you will use in this activity:
```{r message = FALSE, warning=FALSE}
library(tidyverse)
library(spdep)
library(plotly)
#library(deldir)
library(spatstat)
library(geog4ga3)
```
Begin by loading the data file that we will use in this chapter:
```{r}
data("Walker_Lake")
```
You can verify the contents of the dataframe:
```{r}
summary(Walker_Lake)
```
We have already met this data set before: it contains a set of coordinates `X` and `Y` (units are meters; the origin is false), as well as two quantitative variables `V` and `U` (notice that there are missing observations in `U`), and a factor `T`.
## Uncertainty in the predictions
A common task in the analysis of spatially continuous data is to estimate the value of a variable at a location where it was not measured - or in other words, to spatially interpolate the variable. In Chapter \@ref(spatially-continuous-data-i), we introduced three methods for spatial interpolation based on a sample of observations.
The three algorithms that was saw before (i.e., Voronoi polygons, IDW, and $k$-point means) accomplish the task of providing spatial estimates. The values that we obtain with these methods are called _point estimates_. What is a point estimate? Recall the definition of a field that is the outcome of a purely spatial process:
$$
z_i = f(u_i, v_i) + \epsilon_i
$$
Accordingly, the prediction of the field at a new location is defined as a function of the estimated process and some random residual as follows:
$$
\hat{z}_p = \hat{f}(u_p, v_p) + \hat{\epsilon}_p
$$
The first part of the prediction ($\hat{f}(u_p, v_p)$) is the point estimate of the prediction, whereas the second part ($\hat{\epsilon}_p$) is the random part of the estimate.
The methods we saw in Chapter \@ref(spatially-continuous-data-i) can be used to estimate point estimates of the process. Unfortunately, they do not provide an estimate for the random element, so it is not possible to assess the uncertainty of the estimated values directly, since this depends on the distribution of the random term.
There are different ways in which at least some crude assessment of uncertainty can be attached to point estimates obtained from Voronoi polygons, IDW, or $k$-point means. For example, a very simple approach could be to use the sample variance to calculate intervals of confidence. This could be done as follows.
We know that the sample variance describes the inherent variability in the distribution of values of a variable in a sample. Consider for instance the distribution of the variable in the Walker Lake dataset:
```{r}
ggplot(data = Walker_Lake, aes(V)) +
geom_histogram(binwidth = 60)
```
Clearly, there are no values of the variable less than zero, and values in excess of 1,000 are rare.
The standard deviation of the sample is:
```{r}
sd(Walker_Lake$V)
```
The standard deviation is the average deviation from the mean. We could use this value to say that typical deviations from our point estimates are a function of this standard deviation (to what extent, it depends on the underlying distribution).
A problem with using this approach is that the distribution of the variable is not normal, and the distribution of $\hat{\epsilon}_p$ is unknown; the standard deviation is centered on the mean (meaning that it is a poor estimate for observations away from the mean); and in any case the standard deviation of the sample is too large for local point estimates if there is spatial pattern (since we know that the local mean will vary systematically).
There are other approaches to deal with non-normal variables, for instance Wilcox's test, but some of the other limitations remain.
```{r}
wilcox.test(Walker_Lake$V, conf.int = TRUE, conf.level = 0.95)
```
As an alternative, the _local_ standard deviation could be used.
Consider the case of $k$-point means. The point estimate is based on the values of the $k$-nearest neighbors:
$$
\hat{z}_p = \frac{\sum_{i=1}^n{w_{pi}z_i}}{\sum_{i=1}^n{w_{pi}}}
$$
With:
$$
w_{pi} = \bigg\{\begin{array}{ll}
1 & \text{if } i \text{ is one of } kth \text{ nearest neighbors of } p \text{ for a given }k \\
0 & otherwise \\
\end{array}
$$
The standard deviation could be calculated also based in the values of the $k$-nearest neighbors, meaning that it would be based on the local mean. Here, we will interpolate the field using the Walker Lake data. First create a target grid for interpolation, and extract the coordinates of observations:
```{r}
target_xy = expand.grid(x = seq(0.5, 259.5, 2.2), y = seq(0.5, 299.5, 2.2))
source_xy = cbind(x = Walker_Lake$X, y = Walker_Lake$Y)
```
Interpolation using $k=5$ neighbors:
```{r cache=TRUE}
kpoint.5 <- kpointmean(source_xy = source_xy, z = Walker_Lake$V, target_xy = target_xy, k = 5)
```
We can plot the interpolated field now. These are the interpolated values:
```{r}
ggplot(data = kpoint.5, aes(x = x, y = y, fill = z)) +
geom_tile() +
scale_fill_distiller(palette = "OrRd", trans = "reverse") +
coord_equal()
```
In addition, we can plot the _local_ standard deviation:
```{r}
ggplot(data = kpoint.5, aes(x = x, y = y, fill = sd)) +
geom_tile() +
scale_fill_distiller(palette = "OrRd", trans = "reverse") +
coord_equal()
```
The local standard deviation indicates the typical deviation from the local mean. As expected, the local values of the standard deviation are usually lower than the standard deviation of the sample, and it tends to be larger for the tails, that is the locations where the values are rare - we have less information, hence greater uncertainty.
The local standard deviation is a crude estimator of the uncertainty because we do not know the underlying distribution. Other approaches based on bootstrapping (randomly sampling from the observed values of the variable) could be implemented, but they are beyond the scope of the present discussion.
The issue of assessing the level of uncertainty in the predictions with Voronoi polygons, IDW, and $k$-point means reflects the fact that these methods were not designed to deal explicitly with the random nature of predicting fields. Other methods deal with this issue more naturally. We will revisit two estimation methods that we covered before, and see how they can be applied to spatial interpolation.
## Trend surface analysis
Trend surface analysis is a form of multivariate regression that uses the coordinates of the observations to fit a surface to the data.
We can illustrate this technique by means of a simulated example. We will begin by simulating a set of observations, beginning with the coordinates in the square unit region:
```{r}
# `n` is the number of observations to simulate
n <- 180
# Here we create a dataframe with these values: `u` and `v` will be the coordinates of our process
df <- data.frame(u = runif(n = n, min = 0, max = 1),
v = runif(n = n, min = 0, max = 1))
```
Once we have simulated the coordinates for the example, we can plot their locations:
```{r}
ggplot(data = df, aes(x = u, y = v)) +
geom_point() +
coord_equal()
```
We can now proceed to simulate a spatial process as follows:
```{r}
# Use `mutate()` to create a new stochastic variable `z` that is a function of the coordinates and a random normal variable that we create with `rnorm()`; this random variable has a mean of zero and a standard deviation of 0.1.
df <- mutate(df, z = 0.5 + 0.3 * u + 0.7 * v + rnorm(n = n, mean = 0, sd = 0.1))
```
A 3D scatterplot can be useful to explore the data:
```{r}
# Create a 3D scatterplot with the function `plot_ly()`. Notice that the way this function works is similar to `ggplot2`: the arguments are a dataframe, what should be plotted on the x-axis, the y-axis, the z-axis, and other aesthetics (aspects) of the plot. Here the color will be proportional to the values of `z`. The function `add_markers()` is similar to the family of `geom_` functions in `ggplot2`, but more general, since it will try to guess what you are trying to plot based on the inputs (in this case points). The function `layout()` is used to control other parts of the plot: here the `aspectratio` is selected so that the scale is identical for all three axes.
plot_ly(data = df, x = ~u, y = ~v, z = ~z, color = ~z) %>%
add_markers() %>%
layout(scene = list(
aspectmode = "manual", aspectratio = list(x=1, y=1, z=1)))
```
We can fit a trend surface to the data as follows. This is a regression model that uses the coordinates of the observations as covariates. In this case, the trend is linear:
```{r}
trend.l <- lm(formula = z ~ u + v, data = df)
summary(trend.l)
```
Given a trend surface model, we can estimate the value of the variable $z$ at locations where it was not measured. Typically this is done by interpolating on a fine grid that can be used for visualization or further analysis, as shown next.
We will begin by creating a grid for interpolation. We will call the coordinates `x.p` and `y.p`. We generate these by creating a sequence of values in the domain of the data, for instance in the [0,1] interval:
```{r}
u.p <- seq(from = 0.0, to = 1.0, by = 0.05)
v.p <- seq(from = 0.0, to = 1.0, by = 0.05)
```
For prediction, we want all combinations of `x.p` and `y.p`, so we expand these two vectors into a grid, by means of the function `expand.grid()`:
```{r}
# The function `expand.grid()` creates a grid with all the combination of values of the inputs.
df.p <- expand.grid(u = u.p, v = v.p)
```
Notice that while `u.p` and `v.p` are vectors of size 21, the dataframe `df.p` contains `{r}21 * 21` observations, that is, all the combinations of `u.p` and `v.p`.
Once we have the coordinates for interpolation, the `predict()` function can be used in conjunction with the results of the estimation. When invoking the function, we indicate that we wish to obtain as well the standard errors of the fitted values (`se.fit = TRUE`), as well as the interval of the predictions at a 95% level of confidence:
```{r}
preds <- predict(trend.l, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)
```
The interval of confidence _of the predictions_ at the 95% level of confidence is given in the form of the lower (`lwr`) and upper (`upr`) bounds:
```{r}
summary(preds$fit)
```
These values indicate that the predictions of $z_p$ are, with 95% of confidence, in the following interval:
$$
CI_{z_p} = [z_{p_{lwr}}, z_{p_{upr}}].
$$
A convenient way to visualize the results of the analysis above, that is, to inspect the trend surface and the interval of confidence of the predictions, is by means of a 3D plot as follows.
First create matrices with the point estimates of the trend surface (`z.p`), and the lower and upper bounds (`z.p_l`, `z.p_u`):
```{r}
z.p <- matrix(data = preds$fit[,1], nrow = 21, ncol = 21, byrow = TRUE)
z.p_l <- matrix(data = preds$fit[,2], nrow = 21, ncol = 21, byrow = TRUE)
z.p_u <- matrix(data = preds$fit[,3], nrow = 21, ncol = 21, byrow = TRUE)
```
The plot is created using the coordinates used for interpolation (`x.p` and `y.p`) and the matrices with the point estimates `z.p` and the upper and lower bounds. The type of plot in the package `plotly` is a _surface_:
```{r}
trend.plot <- plot_ly(x = ~u.p, y = ~v.p, z = ~z.p,
type = "surface", colors = "YlOrRd") %>%
add_surface(x = ~u.p, y = ~v.p, z = ~z.p_l,
opacity = 0.5, showscale = FALSE) %>%
add_surface(x = ~u.p, y = ~v.p, z = ~z.p_u,
opacity = 0.5, showscale = FALSE) %>%
layout(scene = list(
aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))
trend.plot #%>% add_markers(data = df, x = ~x, y = ~y, z = ~z, color = ~residual)
```
In this way, we have not only an estimate of the underlying field, but also a measure of uncertainty for our predictions, since our estimated values are bound, with 95% confidence, between the lower and upper surfaces.
It is important to note that, although the confidence interval provides a measure of uncertainty, it does not provide an estimate of the prediction error $\hat{\epsilon}_p$. This quantity cannot be calculated directly, because we _do not know the true value of the field at location $p$_. We will revisit this point later.
For the time being, we will apply trend surface analysis to the Walker Lake dataset.
We will first calculate the polynomial terms of the coordinates, for instance to the 3rd degree (this can be done to any arbitrary degree, however keeping in mind the caveats discussed previously with respect to trend surface analysis):
```{r}
Walker_Lake <- mutate(Walker_Lake,
X3 = X^3, X2Y = X^2 * Y, X2 = X^2,
XY = X * Y,
Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)
```
We can proceed to estimate the following models.
Linear trend surface model:
```{r}
WL.trend1 <- lm(formula = V ~ X + Y, data = Walker_Lake)
summary(WL.trend1)
```
Quadratic trend surface model:
```{r}
WL.trend2 <- lm(formula = V ~ X2 + X + XY + Y + Y2, data = Walker_Lake)
summary(WL.trend2)
```
Cubic trend surface model:
```{r}
WL.trend3 <- lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3,
data = Walker_Lake)
summary(WL.trend3)
```
Inspection of the results of the three models above suggests that the cubic trend surface provides the best fit, with the highest adjusted coefficient of determination, even if the value is relatively low at approximately 0.16. Also, the cubic trend yields the smallest standard error, which implies that the intervals of confidence are tighter, and hence the degree of uncertainty is smaller.
We will compare two of these models to see how well they fit the data.
First, we create an interpolation grid. Summarize the information to ascertain the domain of the data:
```{r}
summary(Walker_Lake[,2:3])
```
We can see that the spatial domain is in the range of [8,251] in X, and [8,291] in Y. Based on this information, we will generate the following sequence that then is expanded into a grid for prediction:
```{r}
X.p <- seq(from = 0.0, to = 255.0, by = 2.5)
Y.p <- seq(from = 0.0, to = 295.0, by = 2.5)
df.p <- expand.grid(X = X.p, Y = Y.p)
```
To this dataframe we add the polynomial terms:
```{r}
df.p <- mutate(df.p, X3 = X^3, X2Y = X^2 * Y, X2 = X^2,
XY = X * Y,
Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)
```
The interpolated quadratic surface is then obtained as:
```{r}
WL.preds2 <- predict(WL.trend2, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)
```
Whereas the interpolated cubic surface is obtained as:
```{r}
WL.preds3 <- predict(WL.trend3, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)
```
The predictions are transformed into matrices for plotting.
Quadratic trend surface and lower and upper bounds of the predictions:
```{r}
z.p2 <- matrix(data = WL.preds2$fit[,1], nrow = 119, ncol = 103, byrow = TRUE)
z.p2_l <- matrix(data = WL.preds2$fit[,2], nrow = 119, ncol = 103, byrow = TRUE)
z.p2_u <- matrix(data = WL.preds2$fit[,3], nrow = 119, ncol = 103, byrow = TRUE)
```
Cubic trend surface and lower and upper bounds of the predictions:
```{r}
z.p3 <- matrix(data = WL.preds3$fit[,1], nrow = 119, ncol = 103, byrow = TRUE)
z.p3_l <- matrix(data = WL.preds3$fit[,2], nrow = 119, ncol = 103, byrow = TRUE)
z.p3_u <- matrix(data = WL.preds3$fit[,3], nrow = 119, ncol = 103, byrow = TRUE)
```
This is the quadratic trend surface with its confidence interval of predictions:
```{r}
WL.plot2 <- plot_ly(x = ~X.p, y = ~Y.p, z = ~z.p2,
type = "surface", colors = "YlOrRd") %>%
add_surface(x = ~X.p, y = ~Y.p, z = ~z.p2_l,
opacity = 0.5, showscale = FALSE) %>%
add_surface(x = ~X.p, y = ~Y.p, z = ~z.p2_u,
opacity = 0.5, showscale = FALSE) %>%
layout(scene = list(
aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))
WL.plot2
```
And, this is the cubic trend surface with its confidence interval of predictions:
```{r}
WL.plot3 <- plot_ly(x = ~X.p, y = ~Y.p, z = ~z.p3,
type = "surface", colors = "YlOrRd") %>%
add_surface(x = ~X.p, y = ~Y.p, z = ~z.p3_l,
opacity = 0.5, showscale = FALSE) %>%
add_surface(x = ~X.p, y = ~Y.p, z = ~z.p3_u,
opacity = 0.5, showscale = FALSE) %>%
layout(scene = list(
aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))
WL.plot3
```
Alas, these models are not very reliable estimates of the underlying field. As can be seen from the plots, the confidence intervals are extremely wide, and in both cases include negative numbers in the lower bound. The uncertainty associated with these predictions is quite substantial.
Another question, however, is whether the point estimates are accurate. To get a sense of whether this is the case we can add the observations to the plot:
```{r}
WL.plot3 %>%
add_markers(data = Walker_Lake, x = ~X, y = ~Y, z = ~V,
color = ~V, opacity = 0.7, showlegend = FALSE)
```
Alas, the trend surface does a mediocre job with the point estimates as well.
A possible reason for this is that the model failed to capture all or even most of the systematic spatial variability of this field. To explore this, we will plot the residuals of the model, after labeling them as "positive" or "negative":
```{r}
Walker_Lake$residual3 <- ifelse(WL.trend3$residuals > 0, "Positive", "Negative")
```
Plot the residuals:
```{r}
ggplot(data = Walker_Lake,
aes(x = X, y = Y, color = residual3)) +
geom_point() +
coord_equal()
```
Visual inspection of the distribution of the residuals strongly suggests that they are not random. We can check this by means of Moran's $I$ coefficient, if we create a list of spatial weights as follows:
```{r}
# Create a set of spatial weights with the 5 nearest neighbors.
WL.listw <- Walker_Lake[,2:3] %>%
as.matrix() %>%
knearneigh(k = 5) %>%
knn2nb() %>%
nb2listw()
```
The results of the autocorrelation analysis of the residuals are:
```{r}
moran.test(x = WL.trend3$residuals, listw = WL.listw)
```
Given the low $p$-value, we fail to reject the null hypothesis, and conclude, with a high level of confidence, that the residuals are not independent. This has important implications for spatial interpolation, as we will discuss in the following chapter.
## Accuracy and precision
Before concluding this chapter, it is worthwhile to make the following distinction between accuracy and precision of the estimates.
Accuracy refers to how close the predicted values $\hat{z}_p$ are to the true values of the field. Precision refers to how much uncertainty is associated with such predictions. Narrow intervals of confidence imply greater precision, whereas the opposite is true when the intervals of confidence are wide.
An example of these two properties is as shown in Figure 1.
![Figure 1. Accuracy and precision](Spatially Continuous Data II - Figure 1.jpg)
Panel a) in the figure represents a set of accurate points, since they are on average close to the mark. However, they are imprecise, given their variability. This is akin to a good point estimate that has wide confidence intervals.
Panel b) is a set of inaccurate and imprecise points.
Panel c) is a set of precise but inaccurate points.
Finally, Panel d) is a set of accurate and precise points.
Accuracy and precision are important criteria when assessing the quality of a predictive model.
This concludes the chapter.