-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-GIS-contract.Rmd
222 lines (126 loc) · 11 KB
/
02-GIS-contract.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# The GIS contract {#gis-contract}
# Breaking the GIS contract
GIS provides a table-based front-end where there is a one-to-one relationship between a geometric object, and a row in a table that contains attribute metadata about that object. I call this the "GIS contract", and you can see this in the linked selections (brushing) in QGIS, Manifold and other systems.
There is some skepticism when it comes to the idea of using tables to store geometry, and the main concerns I've seen are
- you have to copy the metadata onto every coordinate row
- you cannot properly store polygons with holes
But
1. this is clearly not the case, it just means "we need two tables"
2. you can store polygons with holes, but ggplot2 would not draw them properly
These are readily fixable problems, in fact sp already shows the way in which a single-table front-end can be used to hide the messy recursive list details of the geometyr, why not go the next step and write hte front end to store just tables?
a) we can nest tables
b) we can class lists of tables, so that all methods dispatch on the metadata table
- do this with S3, S4, and show the templating available by cascading the innerjoin for coordinate extraction, or the cascading semi join to apply a asubsetting down the structure.
I would like to see a system that accepts a standard input format, say classified vectors of coordinates, to the polypath and pathGrob functions, that both sp and gg and friends leverage.
I find it more compelling that the problems are
1. Copying metadata onto coordinates has the danger of ids and values getting out of synch, but at times we also want our groupings to override the "special id" anyway. It's more dangerous that the order attribute and the groupings get out of synch, or get subsetted in ways that arent' appropriate. but as ever, our front end can protect users from making those mistakes, without also making it difficult to *really mean to make those mistakes*
2. the ring model for apolygon is susceptible to star and other self-intersecting problems, and the winding/ evenodd rule does not apply to point in tpolygoin tests
## Spatial in R
R is really powerful for doing spatial in R. sp, raster, rgeos, ncdf4, spatstat, amazing.
Spatial generally is not good enough, GIS is stuck in the 1980s with shapefiles and coverages, and when it pokes out it's very bespoke and not very flexible. Flexible is the domain of modellers and scientific programmers, but too much flexibilty that doesn't take on the key advantages in GIS is painful.
## Key advantages in GIS
Vector shapes are complex by default, and built to live on a database, it's a very natural fit.
Affine transform is the default, which works in a huge set of cases.
## Fuzzy areas
Continuous vs. discrete is/isn't a strength in GIS/modelling.
## The **sp** package
Spatial classes, formal inheritance, heirarchical objects.
The Spatial classes provided by the sp package are very widely used because they provide the formal guarantee of the GIS contract, and use this for a systematic coupling with other tools:
* the huge number of formats provided by GDAL input/output
* powerful high-level methods for visualization, manipulation, analysis and modelling
*
The DataFrame is the basic unit in _sp_, but it's not *actually* a data frame: *every data frame-like behaviour is provided indirectly* by methods. This works very well, in that the object behaves like a data frame with "[" and has dim, nrows, and "$" ... but the disconnect between this and the underlying geometric details eventually becomes a limitation. Users routinely reach under the hood to get at the details with "@" ...
The attribute metadata are discrete but the geometry is continuous - this an important point, relate to the space-is-general discussion. Spatial topology ties the geometry together, and database topology ties the system together.
Spatial in `sp` already knows what to do with these complex objects, and there's a one-to-one relationship between the rows of the "DataFrame" part and the objects in the Spatial layer.
```{r}
library(maptools)
sp1 <- rgdal::readOGR(system.file("shapes", package = "maptools"), "sids")
sp1$id <- row.names(sp1) ## ensure we have a unique attribute
sp1$id_number <- as.numeric(sp1$id)
spplot(sp1["id_number"])
```
## The **ggplot2** package
Data frames are the fundamental unit for **ggplot2**, but for spatial data the first step is to decompose the hierarchical complexity of a Spatial object to a single data frame.
This breaks the "GIS contract", since the individual objects are now spread over multiple rows of the table of coordinates.
There's a disconnect between the `sp` package and `ggplot2`, with main issue being that `ggplot2` needs everything in one table while `sp` uses a single-object with a metadata data frame interface to the underlying nested-list geometry.
With ggplot2 we can get the same details, but first we need to `tidy()` the `Spatial` object (was `fortify()`). This process creates a single table with every coordinate, and classifying columns that record what object and part of an object it came from, as well as the order and some other attributes.
```{r}
library(ggplot2)
library(broom)
library(tibble)
geomtab <- tidy(sp1, region = "id") ## or ggplot2::fortify()
## ggplot2 creates factors
#geomtab$piece <- levels(geomtab$piece)[geomtab$piece]
#geomtab$group <- levels(geomtab$group)[geomtab$group]
head(geomtab)
metatab <- as_tibble(as.data.frame(sp1))
library(dplyr)
# ftable <- ftable %>%
# inner_join(spmeta[, "rownumber"] %>% mutate(id = as.character(row_number() - 1)))
ggplot(geomtab) +
aes(x = long, y = lat, group = group, fill = id) +
geom_polygon() +
scale_fill_manual(values = bpy.colors())
```
The interesting aspect here is that we've begun to juggle two R objects, where previously we had only one. It is possible to store the entire data into one table but for that we need to spread all the object metadata across every row of the geometry table.
But there are other options. One is to `nest` the tables into the metadata data frame.
## Nested data frames
R data frames can now be easily nested using the `tidyr` package.
```{r}
library(purrr)
#single_level <- as_tibble(as.data.frame(sp1)) %>% mutate( geometry = (ftable %>% split(.$id))[unique(ftable$id)])
library(tidyr)
single_level <- nest(geomtab, -id, .key = geometry_) %>% inner_join(metatab)
```
In this aproach we have split the `tidy/fortify` geometry table across each object and placed it in the right row of the metadata data frame.
This is a nice compromise since we can now do the standard data frame operations and recreate the full structure needed for `ggplot2`.
```{r}
(asub <- single_level %>% filter(NAME %in% c("Hertford", "Camden", "Gates")))
## note, it's important that we select just the geometry_ and gg-vars, to avoid
## copying all attribute values to every coordinate
ggplot(unnest(asub %>% select(NAME, geometry_))) + aes(x = long, y = lat, group = group, fill = NAME) + geom_polygon()
```
Each component geometry table has all the coordinates classified appropriately as before. Notice that there are only a few objects with more than one part. We choose one and look at its "geometry".
`Dare` is composed of three parts, and we have a copy of that identifier and its `hole` status on every coordinate.
```{r}
geomtab %>% distinct(group, id) %>% group_by(id) %>% summarize(count = n()) %>% filter(count > 1)
single_level %>% filter(id == "55") %>% select(NAME, geometry_) %>% unnest()
single_level %>% filter(id == "55") %>% select(NAME, geometry_) %>% unnest() %>% distinct(piece, .keep_all = TRUE)
```
It's really not necessary to put those data on the coordinates, so let's nest again.
```{r}
double_level <- geomtab %>%
group_by(group, hole, id) %>% nest(-piece, .key = group_) %>%
group_by(id) %>% nest(.key = id_) %>%
inner_join(metatab)
```
What do we have?
A tidy single data frame with one row per `Spatial` object, and recursively nested parts with their coordinates:
```{r}
dim(double_level)
str(filter(double_level, NAME == "Currituck")$id_)
```
We can make `ggplot` type selections and plot as before.
```{r}
bsub <- double_level %>% filter(NAME %in% c("Currituck", "Camden", "Dare")) %>% select(NAME, id_) %>% unnest() %>% unnest()
ggplot(bsub) + aes(x = long, y = lat, group = group, fill = NAME, col = group) + geom_polygon()
```
Also we can restore this object to its `Spatial` form.
```{r}
bsub_spatial <- spbabel::sp(transmute(bsub, object_ = as.character(NAME), branch_ = group, island_ = !hole, order_ = order, x_ = long, y_ = lat),
attr_tab = distinct(bsub, NAME, .keep_all = TRUE))
plot(bsub_spatial, col = rainbow(nrow(bsub_spatial), alpha = "0.4"))
```
## Connections
ggplot2 lets you reach under the hood in an abstract way, in that you can specify aesthetics and topology generally, you assign the x, y, the lower-grouping ID (for parts within objects), and the grouping ID (for objects) for the definitions of what the objects are and the attributes used. This is extremely powerful, but not generally used.
ggplot2 examples
Example of a perspective plot in ggplot2 using transformation? Or another 3rd geometry attribute . . .
Ability to drop holes (though limitations of mult-hole anyway)
This flexibility is what a lot of users want, but sp *really* makes it hard.
## Pros and cons
Ability to user-choose attributes from the data - independence of analysis from the visualization
## Relations and the difference between sp and ggplot2 forms
There's a difficulty for non-experts to deal with relational data, there's a level of abstraction in the process that provides confusion. We see this in many fields, where a single table is the basic unit of analysis and the lessons of database normalization are nowhere to be seen. A common example is animal tracking data, which at the minimum stores a trip (or burst or group) ID, x, y, date-time, and may include individual or tag ID (object). Technically, these data should have a metadata table with observations about the tag deployment (date, location, recovery, animal departure/return date), an individual table with observations on the animal species, and the table of coordinates of the actual tag space-time measurements.
This is a well-recognized problem, especially in collaborative studies where the entire data set is stored in a single CSV ... [probably less ranting here,] but the relational table example is a good one.
A key definition here is the idea of a structural index (the row or column number in a table or array) versus a relational index where the value of a key is used to match records. The relational index can be transferred from one data set to another by subsetting and appending, and survive resorting generally - but the structural index cannot - it must either be maintained in its position or be updated when the overal dataset is subsetted, or changed.
Recurisve objects like lists in R stand in place of both structurual and relational indexes, the structure of the list is an implicit marker of the index - though it might also store a particular label.