-
Notifications
You must be signed in to change notification settings - Fork 0
/
s3_results.tex
314 lines (300 loc) · 22.7 KB
/
s3_results.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
\section{Results} % 500 words in total
\label{sec:results}
\subsection*{Global accuracy}
% Overview of numbers
% global accuracy
Global accuracy is shown in Table \ref{tab:global_accuracy}. It may seem that the
results are underwhelming but taking a closer look, it is true only for some models,
usually using smaller chip sizes and simpler architecture. Out of 60 tested models,
14 have global accuracy over 0.5, 6 over 0.6 and one over 0.7. Compared to the relevant
metrics from established LULC models, \citealp{venter2022global} reports that ESRI's
Land Cover reaches global accuracy of .75, Google's Dynamic World 0.71 and ESA's World
Cover 0.65, rendering our highest performing models at par with these. However, they
do perform considerably worse than established LCZ models, that report global accuracy of
0.87 \citep{taubenbock2020}. Yet, this difference is expected as LCZ classes are designed with remote sensing
in mind while spatial signatures aim to reflect the form and function independent of whether
the distinction between two signature types shall be seen from satellite imagery. Nevertheless,
global accuracy is far from providing the full picture.
\begin{table}
\centering
\begin{tabular}{llrrr}
\toprule
& Chip Size & B.I.C. & S.I.C & M.O.R \\
\midrule
\multirow[t]{4}{*}{maxprob} & 8 & 0.30 & 0.32 & 0.29 \\
& 16 & 0.27 & 0.35 & 0.28 \\
& 32 & 0.42 & 0.34 & 0.35 \\
& 64 & \textit{0.50} & 0.46 & \textit{0.58} \\
\cline{1-2}
\multirow[t]{4}{*}{logite} & 8 & 0.32 & 0.35 & 0.31 \\
& 16 & 0.36 & 0.36 & 0.31 \\
& 32 & 0.46 & 0.36 & 0.36 \\
& 64 & \textbf{0.60} & 0.47 & \textit{0.58} \\
\cline{1-2}
\multirow[t]{4}{*}{logite-wx} & 8 & 0.34 & 0.41 & 0.33 \\
& 16 & 0.42 & 0.46 & 0.33 \\
& 32 & \textit{0.52} & 0.48 & 0.39 \\
& 64 & \textbf{0.67} & \textit{0.55} & \textit{0.59} \\
\cline{1-2}
\multirow[t]{4}{*}{HistGradientBoostingClassifier} & 8 & 0.32 & 0.36 & 0.32 \\
& 16 & 0.37 & 0.37 & 0.33 \\
& 32 & 0.48 & 0.40 & 0.49 \\
& 64 & \textbf{0.62} & 0.48 & \textit{\textbf{0.71}} \\
\cline{1-2}
\multirow[t]{4}{*}{HistGradientBoostingClassifier-wx} & 8 & 0.35 & 0.44 & 0.35 \\
& 16 & 0.44 & 0.47 & 0.34 \\
& 32 & \textit{0.54} & \textit{0.50} & 0.40 \\
& 64 & \textbf{0.68} & \textit{0.56} & \textbf{0.63} \\
\bottomrule
\end{tabular}
\caption{\label{tab:global_accuracy}\footnotesize Global accuracy of all the models
tested in this study. Values higher than 0.5 are highlighted in italic, values higher than
0.6 in bold and a value over 0.7 in bold italic. Similar tables representing other global
performance metrics (Cohen's Kappa, Macro F1-score, Weighted F1-score) can be found in Appendix \ref{sec:appendix_perf}}.
\end{table}
\subsection*{Within-class accuracy}
% figure 1 - within class accuracy per model
Within-class accuracy by the model can be seen in Figure \ref{fig:wc_accuracy_x_model} (a
sister figure where scores are grouped by signature rather than by model can be found in
Appendix \ref{sec:appendixB}). We notice some consistent patterns already. The
baseline image classification (\texttt{bic}) tends to underperform other architectures,
especially on more urban signature types. On the other hand, multi-output regression (\texttt{mor}), and
using the larger chip size (32 or 64), tends to show the highest values across signature
types and models. If we look at accuracy for individual signature types, both extremes
(urbanity on one side and both countryside classes on the other)
tend to be the easiest to predict. Regarding the models, there is no immediate conclusion to be
made apart from a clear indication that the maximum probability (\texttt{maxprob}) approach is generally
worse than any of the modelling, suggesting that there is a value in the modelling step.
The within-class accuracy can be further explored using confusion matrices available in
Appendix \ref*{sec:appendixC}.
\begin{figure}
\centering
\includegraphics[width=1.0\linewidth]{fig/wc_accuracy_x_model.png}
\caption{\footnotesize Within-class accuracy scores grouped by model. Each panel
represents results from one of the five models compared, namely:
histogram-based boosted classifier (\texttt{HGBC}) with features
pertaining only to a given chip (\texttt{baseline}) or including also features
from neighbouring ones (\texttt{baseline-wx}); Logit ensemble
(\texttt{logite}) with the same two variations; and a simpler maximum
probability approach (\texttt{maxprob}). Each row in the heatmap
corresponds to a pair of chipsize (8, 16, 32, and 64 pixels)
and architechture (baseline image classification, or \texttt{bic}; sliding
image classification, or \texttt{sic}; and multi-output
regression, or \texttt{mor}) used in the neural network stage of the
pipeline. Colouring is standardised across panels and values range from
0 (dark purple) to 1 (bright yellow).}
\label{fig:wc_accuracy_x_model}
\end{figure}
\subsection*{Regression outputs for global performance metrics}
% global non-spatial performance regression
Whilst plotting the accuracy is a way to build an intuition about the performance of
individual options, it does not quantify their effects. The linear regressions shown in
tables \ref{tab:non_sp_reg} and \ref{tab:non_sp_reg_wc} provide better insight. The
first regression explains global performance scores (Cohen's kappa, Global Accuracy, Marco F1
weighted and Macro F1 average). We can draw a few conclusions from this. First, the chip size
seems to have a positive effect on the results, as it is consistently
significant across all metrics. Except for the average macro F1 score, there is a
positive effect of the inclusion of spatial lag in the modelling step (W). Regarding the
CNN step, we do not see a lot of significance but there are indications that sliding
image classification and multi-output regression approaches outperform baseline image
classification. Comparing the probability modelling step, we see an indication that the
maximum probability is the least performant of the options, again suggesting the value
of post-CNN modelling.
% table 1 non spatial, one col for regression
\begin{table}
\centering
\begin{tabular}{lcccc}
\toprule
{} & $\kappa$ & Global Accuracy & Macro F1 w. & Macro F1 avg. \\
\midrule
Intercept & 0.2185*** & 0.3236*** & 0.2790*** & 0.1798*** \\
& (0.0209) & (0.0175) & (0.0174) & (0.0375) \\
(M) Logit E. & -0.0245 & -0.0256* & -0.0324** & -0.0325 \\
& (0.0168) & (0.0141) & (0.0141) & (0.0302) \\
(M) Max. Prob. & -0.0559** & -0.0606*** & -0.0421** & -0.0296 \\
& (0.0222) & (0.0187) & (0.0186) & (0.0399) \\
(A) M.O.R. & 0.0227 & -0.0357** & -0.0278* & 0.1787*** \\
& (0.0184) & (0.0155) & (0.0154) & (0.0331) \\
(A) S.I.C. & 0.0232 & -0.0247 & -0.0171 & 0.1101*** \\
& (0.0184) & (0.0155) & (0.0154) & (0.0331) \\
Chip Size & 0.0036*** & 0.0043*** & 0.0048*** & 0.0014** \\
& (0.0004) & (0.0003) & (0.0003) & (0.0006) \\
W & 0.0572*** & 0.0468*** & 0.0531*** & 0.0392 \\
& (0.0168) & (0.0141) & (0.0141) & (0.0302) \\
\midrule
$R^2$ & 0.7214 & 0.8281 & 0.8514 & 0.4191 \\
$R^2$ Adj. & 0.6899 & 0.8086 & 0.8346 & 0.3533 \\
N. & 60 & 60 & 60 & 60 \\
\bottomrule
\end{tabular}
\caption{\label{tab:non_sp_reg}\footnotesize Regression outputs explaining
global non-spatial
performance scores. Explanatory variables with a preceding (M) and (A)
correspond to binary variables for the type of model (with histogram-based
boosted classifier, or \texttt{HGBC}, as the
baseline) and architecture (with baseline image classification, or
\texttt{BIC}, as the baseline),
respectively. Standard errors in parenthesis. Coefficients significant at
the 1\%, 5\%, 10\% level are noted with ***, **, and *, respectively.}
\end{table}
\subsection*{Regression outputs for within-class accuracy}
% within class accuracy regression
Table \ref{tab:non_sp_reg_wc} then looks again at the within-class accuracy explaining
what we have seen in Figure \ref{fig:wc_accuracy_x_model}.
Multi-output regression consistently outperforms both baseline image classification and
sliding image classification (which shows inconsistent results itself). Chip size has,
again, a positive effect on the performance, while the inclusion of spatial
lag in the modelling also consistently shows a positive impact. As assumed above, the
prediction of signature types on both extremes of the urban-wild range tends to be
easier than classes in between, which are, conceptually, the most challenging to predict
due to the higher amount of \textit{transition land} between class core areas.
\begin{table}
\begin{tabular}{lccc}
\toprule
{Within-Class Accuracy} & Baseline & Absolute imb. & Relative imb. \\
\midrule
Intercept & 0.1866*** & -0.0237 & 0.0595** \\
& (0.0308) & (0.0311) & (0.0303) \\
(M) Logit E. & -0.0125 & -0.0125 & -0.0125 \\
& (0.0159) & (0.0141) & (0.0146) \\
(M) Max. Prob. & -0.0188 & -0.0188 & -0.0188 \\
& (0.0211) & (0.0186) & (0.0193) \\
(A) M.O.R. & 0.1753*** & 0.2512*** & 0.1753*** \\
& (0.0175) & (0.0163) & (0.0160) \\
(A) S.I.C. & 0.1202*** & -0.0783*** & 0.1202*** \\
& (0.0175) & (0.0209) & (0.0160) \\
Chip Size & 0.0014*** & 0.0041*** & 0.0014*** \\
& (0.0003) & (0.0003) & (0.0003) \\
1k Obs. & & 0.0514*** & \\
& & (0.0036) & \\
\% Obs. & & & 0.0156*** \\
& & & (0.0013) \\
W & 0.0365** & 0.0365*** & 0.0365** \\
& (0.0159) & (0.0141) & (0.0146) \\
(S)Urbanity & 0.2358*** & 0.2022*** & 0.2574*** \\
& (0.0349) & (0.0309) & (0.0320) \\
(S)Dense urban neighbourhoods & -0.1420*** & -0.1075*** & -0.0998*** \\
& (0.0349) & (0.0309) & (0.0322) \\
(S)Dense residential neighbourhoods & -0.1414*** & -0.0836*** & -0.0983*** \\
& (0.0349) & (0.0311) & (0.0322) \\
(S)Connected residential neighbourhoods & -0.1306*** & -0.0726** & -0.0754** \\
& (0.0349) & (0.0311) & (0.0323) \\
(S)Gridded residential quarters & -0.0785** & -0.0127 & -0.0049 \\
& (0.0349) & (0.0312) & (0.0326) \\
(S)Disconnected suburbia & -0.0601* & -0.0103 & -0.0019 \\
& (0.0349) & (0.0311) & (0.0324) \\
(S)Open sprawl & -0.0845** & -0.0995*** & -0.1143*** \\
& (0.0349) & (0.0309) & (0.0321) \\
(S)Warehouse park land & -0.0857** & -0.0788** & -0.0817** \\
& (0.0349) & (0.0309) & (0.0320) \\
(S)Urban buffer & -0.0828** & -0.1382*** & -0.1753*** \\
& (0.0349) & (0.0311) & (0.0330) \\
(S)Countryside agriculture & 0.2236*** & 0.1593*** & 0.1118*** \\
& (0.0349) & (0.0312) & (0.0334) \\
(S)Wild countryside & 0.3876*** & 0.3283*** & 0.2925*** \\
& (0.0349) & (0.0311) & (0.0330) \\
\midrule
$R^2$ & 0.4979 & 0.6087 & 0.5794 \\
$R^2$ Adj. & 0.4857 & 0.5987 & 0.5686 \\
N. & 720 & 720 & 720 \\
\bottomrule
\end{tabular}
\caption{\label{tab:non_sp_reg_wc}\footnotesize Regression outputs explaining
within-class accuracy. Explanatory variables with a preceding (M),
(A) and (S)
correspond to binary variables for the type of model (with histogram-based
boosted classifier, or \texttt{HGBC}, as the
baseline), architecture (with baseline image classification, or
\texttt{BIC}, as the baseline) and spatial signature (with Accessible
suburbia as the baseline),
respectively. Standard errors in parenthesis. Coefficients significant at
the 1\%, 5\%, 10\% level are noted with ***, **, and *, respectively.}
\end{table}
% figure 2 - map for a single class target/prediction/
% \begin{figure}
% \centering
% \includegraphics[width=1.0\linewidth]{fig/wc_accuracy_x_model.png}
% \caption{\footnotesize TBC}
% \label{fig:prediction_comparison_maps}
% \end{figure}
% DAB: for space constraints, we have decided to drop it for now
\subsection*{Regression outputs for spatial performance metrics}
% spatial performance regression
The regression outputs explaining differences in the spatial pattern between observed
and predicted values measured by the Join Counts statistic offer another - spatially
explicit - perspective on the performance of tested model configurations. As such, it
also indicates slightly different results as presented in Table \ref{tab:sp_reg_wc}.
Neither option of the probability modelling steps seem to have a significant effect on
the Join Counts results, unlike in previous performance metrics. However, the
architecture of the neural network step shows a significant effect as multi-output
regression, and in two out of four cases also sliding image classification, outperform the
baseline image classification. While the effect of the chip size is inconsistent across the
options, the inclusion of the spatial lag in the modelling step has a significant effect (at
either 10\%, 5\% or 1\% significance level). The effect of a signature type depends on
its nature. More compact urban types like \textit{Urbanity} and \textit{Dense
urban neighbourhoods} show significance when using a distance threshold spatial weights,
while sparser signature types like \textit{Open Sprawl} and \textit{Urban Buffer} show
significance when using a union of weights.
% table 2 spatial
\begin{table}
\begin{tabular}{lcccc}
\toprule
{} & $JC$ & $\log(JC)$ & $JC$ & $\log(JC)$ \\
{} & $W\_{thr}$ & $W\_{thr}$ & $W\_{union}$ & $W\_{union}$ \\
\midrule
Intercept & 4.3454*** & 1.4617*** & 4.7103*** & 1.6311*** \\
& (0.9507) & (0.1344) & (0.5763) & (0.1080) \\
(M) Logit E. & -0.1406 & -0.0431 & 0.1851 & 0.0481 \\
& (0.4951) & (0.0700) & (0.2995) & (0.0561) \\
(M) Max. Prob. & 0.1128 & -0.1223 & 0.2819 & 0.0223 \\
& (0.6442) & (0.0911) & (0.3887) & (0.0728) \\
(A) M.O.R. & -3.1630*** & -0.5744*** & -2.7875*** & -0.4647*** \\
& (0.5494) & (0.0777) & (0.3301) & (0.0619) \\
(A) S.I.C. & 0.0119 & -0.2390*** & -0.6666** & -0.0481 \\
& (0.5532) & (0.0782) & (0.3329) & (0.0624) \\
Chip Size & 0.0297*** & -0.0005 & -0.0061 & -0.0080*** \\
& (0.0108) & (0.0015) & (0.0065) & (0.0012) \\
W & -0.9325* & -0.1376** & -0.9556*** & -0.1785*** \\
& (0.4945) & (0.0699) & (0.2991) & (0.0560) \\
(S)Urbanity & 4.6650*** & 0.6574*** & 0.1156 & -0.1258 \\
& (1.0696) & (0.1512) & (0.6460) & (0.1211) \\
(S)Dense urban neighbourhoods & 1.7796* & 0.5094*** & 0.7480 & 0.1609 \\
& (1.0695) & (0.1512) & (0.6487) & (0.1216) \\
(S)Dense residential neighbourhoods & -0.8545 & 0.0672 & -0.4636 & -0.0920 \\
& (1.0958) & (0.1550) & (0.6647) & (0.1246) \\
(S)Connected residential neighbourhoods & -0.3656 & 0.1543 & -0.4388 & -0.1447 \\
& (1.1018) & (0.1558) & (0.6647) & (0.1246) \\
(S)Gridded residential quarters & -0.2000 & 0.1009 & -0.6203 & -0.2111* \\
& (1.0744) & (0.1519) & (0.6517) & (0.1221) \\
(S)Disconnected suburbia & -0.9752 & -0.1719 & -1.0303 & -0.3358*** \\
& (1.1213) & (0.1586) & (0.6684) & (0.1252) \\
(S)Open sprawl & 1.8342* & 0.1734 & 2.1575*** & 0.3576*** \\
& (1.0604) & (0.1499) & (0.6432) & (0.1205) \\
(S)Warehouse park land & 0.5496 & 0.2123 & 1.2245* & 0.3054** \\
& (1.0694) & (0.1512) & (0.6487) & (0.1216) \\
(S)Urban buffer & -0.0558 & -0.0931 & 2.7027*** & 0.5164*** \\
& (1.0521) & (0.1488) & (0.6382) & (0.1196) \\
(S)Countryside agriculture & -1.3759 & -0.2511* & 0.6623 & 0.0670 \\
& (1.0521) & (0.1488) & (0.6382) & (0.1196) \\
(S)Wild countryside & -2.0183* & -0.5065*** & -0.5918 & -0.1635 \\
& (1.0521) & (0.1488) & (0.6382) & (0.1196) \\
\midrule
$R^2$ & 0.1589 & 0.1954 & 0.2118 & 0.2660 \\
$R^2$ Adj. & 0.1368 & 0.1743 & 0.1913 & 0.2468 \\
N. & 665 & 665 & 670 & 670 \\
\bottomrule
\end{tabular}
\caption{\label{tab:sp_reg_wc}\footnotesize Regression outputs explaining
(log of) differences in the spatial pattern between observed and predicted values,
as measured by the Join Counts statistic. The Join Counts for each signature were computed
using two types of spatial weights: one based on a distance threshold of 1Km ($W\_{thr}$),
and another one built as a the union of nearest neighbor and queen contiguity matrices ($W\_{union}$).
Explanatory variables with a preceding (M), (A) and (S)
correspond to binary variables for the type of model (with histogram-based
boosted classifier, or \texttt{HGBC}, as the
baseline), architecture (with baseline image classification, or
\texttt{BIC}, as the baseline) and spatial signature (with Accessible
suburbia as the baseline),
respectively. Standard errors in parenthesis. Coefficients significant at
the 1\%, 5\%, 10\% level are noted with ***, **, and *, respectively.}
\end{table}