-
Notifications
You must be signed in to change notification settings - Fork 0
/
Examples&BNstruct.Rmd
487 lines (365 loc) · 17.6 KB
/
Examples&BNstruct.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
---
jupyter:
jupytext:
formats: ipynb,Rmd
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.4.1
kernelspec:
display_name: R
language: R
name: ir
---
# Performance tests
We have analyzed the algorithm from a theoretical point of view and tried it on a simple database to understand how it works. Now we want to test it even further: we will use different datasets obtained from the bnlearn website (https://www.bnlearn.com/bnrepository/). This resource gives us also the correct network structure, such that we can be sure of the performances of our algorithm.
This notebook requires the packages "bnstruct" and "bnlearn" <br>
`install.packages('bnstruct')`
`install.packages('bnlearn')`
```{r}
library('bnlearn')
library('bnstruct')
options(repr.plot.width=16, repr.plot.height=8)
```
```{r}
log.fact <- function(m, r){
# Return the logarithm of the first m+r-1 factorials
# m: int, number of cases in the dataset
# r: int, number of possible values a variable can assume
fact <- log( c(1, 1:(m+r-1)) )
for (i in 4:length(fact)) { #first 3 are already correct
fact[i] <- fact[i] + fact[i-1]
}
return(fact)
}
```
```{r}
f <- function(index, parents, database, r, factorials) { #Rename the variables
#Compute N_{ijk}
p <- length(parents)
m <- nrow(database)
todec <- r ** (0:max(0,p-1))
rowcol <- array(dim = m)
index.k <- database[index] + 1 #(0, 1) -> (1, 2)
if (p > 0)
index.j <- as.matrix(database[parents])%*%todec + 1 # 10 -> 2
else
index.j <- rep(1,m)
indexes <- cbind(index.j, index.k)
indexes <- indexes[ order(indexes[,1], indexes[,2]), ]
d <- abs( diff( as.matrix(indexes) ) )
d1 <- d[,1] + d[,2]
idx.unique <- indexes[c(0, which(d1>0), length(d1)+1), ]
N_ijk <- diff( c(0,which(d1>0), length(d1)+1) )
N <- data.frame('Row'=idx.unique[,1], 'Col'=idx.unique[,2], 'Value'=N_ijk)
#print(N)
#print(N)
#Compute N_{ij}
Nj <- aggregate(.~Row,data=N,sum)[,c(3)]
#Compute logN - already row summed
N$Value <- factorials[N$Value+1]
logN <- aggregate(.~Row,data=N,sum)[,c(3)]
factor.first <- factorials[r] #r_i-1+1
factor.second <- factorials[Nj+r] #+r_i -1 +1 (because of the indices)
factor.third <- logN
result <- sum(factor.first - factor.second + factor.third) #factor.first could be brought outside to gain precision (and multiply by qi)
return(result)
}
```
```{r}
# K2 algorithm implementation
# Parameters:
# - database : data.frame of size (m, n). #? Since it is always numeric, we could use a matrix also here!
# Dataset containing m observations of n variables, with no missing values.
# All values must be integers between 0 and r-1.
# - u : integer
# Maximum number of parents for each node.
# - ordering : vector of size n or list of n1 vectors of total size n
# If vector must contain a valid permutation of the integers [1, 2 ... n], representing the "prior" ordering of variables,
# such that the variable with index ordering[i] can have as possible parents only variables with index ordering[j] with j < i
# (i.e. variables that "appear before it" in the provided ordering).
# If list must contains the layers of the networks, representing the "prior" ordering of variables,
# such that the variable with index ordering[[i]] can have as possible parents only variables in the previous layers, and so
# with index ordering[[j]] with j<i. Notice that n1<=n, and if it is equal we come back to the previous case.
# Output:
# adj : matrix of size (n, n)
# Adjacency matrix of the most probable Directed Acyclic Graph found given the evidence in the database.
# adj[i,j] is 1 if there is a connection from node j to node i (i.e. if j is a parent of i) #? We could take the transpose,
# since it is most common to use i -> j
K2 <- function(database, u, r, ordering) {
n <- ncol(database)
m <- nrow(database)
adj <- matrix(data = 0, nrow = n, ncol = n)
factorials <- log.fact(m, r)
for (i in 1:n) {
i.parents <- c()
log.p.old <- f(i, i.parents, database, r, factorials) #log-Probability of a structure with i disconnected
OKToProceed <- TRUE
# Sostituirei la scelta delle posizioni chiamando una funzione definita fuori
# prec(i, ordering) per non appesantire troppo il codice
if (class(ordering)=='list'){ # Layer ordering, each element of the list is a layer
for (h in 1:length(ordering)){
if (i %in% ordering[[h]] ){
i.pos <- h
break
}
}
if( i.pos == 1){
next
} else {
i.parents.candidates <- unlist( ordering[1:i.pos-1] )
}
} else { # Normal ordering, the input is a vector
#Compute vector of candidate parents for i
i.pos <- which(ordering == i) #Position of i in the ordering. Only variables before this position can be parents of i
if (i.pos == 1) { #If i is at the start of the ordering, there are no parents to check
next
} else {
i.parents.candidates <- ordering[1:i.pos-1]
}
}
while (OKToProceed & length(i.parents) < u) {
log.p.new <- -Inf
newparent.index <- NA
#Select the node from the candidate list that maximizes the structure's probability (greedy algorithm)
for (candidate in i.parents.candidates) {
log.p.candidate <- f(i, c(i.parents, candidate), database, r, factorials)
if (log.p.candidate > log.p.new) {
log.p.new <- log.p.candidate
newparent.index <- candidate
}
}
#Accept the new parent candidate only if it increases the total structure's probability
if (log.p.new > log.p.old) {
log.p.old <- log.p.new
i.parents <- c(i.parents, newparent.index)
adj[i, newparent.index] = 1
#Remove the added parent from the candidates list
i.parents.candidates <- i.parents.candidates[-newparent.index]
} else { #If probability does not increase, stop adding parents to i
OKToProceed <- FALSE
}
}
}
return(adj)
}
```
## Exemples
### Learning test
This dataset is a simple dataset used by the bnlearn author to test its algorithms. However it is quite interesting in our case: we have $6$ variable, where the first 5 variables are with $3$-levels and the last one with only $2$ levels. We will understand if our implementation of K$2$ is able to manage non-binary variables and also datasets with variables with different values.
```{r}
from_str_to_num <- function(data){
# Parameters:
# - data: data.frame of string factors
# Output:
# - data.num: data.frame of numeric factors
data.num <- data.frame(sapply(data, as.numeric)-1)
# -1 is needed for our convention with the levels starting from 0
return(data.num)
}
```
```{r}
lt.data <- from_str_to_num(learning.test)
lt.names <- colnames(lt.data)
lt.struct <- K2(lt.data, u=2, r=3, c(1,3,6,2,4,5))
head(lt.data)
```
```{r}
# Notice that our convention for the adjacency matrix is the transpose of the usual one.
# This means that before the plotting we must apply t(adj)
lt.net <- graph_from_adjacency_matrix(t(lt.struct), mode="directed")
plot(lt.net, vertex.label = lt.names, main='Learning test network')
```
Notice that the correct order is:
$$
[A][C][F][B|A][D|A,C][E|B,F]
$$
where we identify with $[\cdot]$ a node with no parents and $[i|j]$ when node $i$ has node $j$ as parent. The order recovered from our algorithm is exactly the original one.
### Lizards
Lizard is a real dataset containing the species, the perch height and the perch diameter of different lizards. It only has three two-level variables, but it is indeed an important test since it is the first attempt to use $K2$ on real world data.
Sources: <br>
Schoener TW (1968). "The Anolis Lizards of Bimini: Resource Partitioning in a Complex Fauna". Ecology, 49(4):704–726.
```{r}
lz.data <- from_str_to_num(lizards)
lz.names <- colnames(lz.data)
lz.struct <- K2(lz.data, 2, 2, c(1,2,3))
head(lz.data)
```
```{r}
lz.net <- graph_from_adjacency_matrix( t(lz.struct), mode="directed")
plot(lz.net, vertex.label = lz.names, main='Lizards network')
```
The result is meaningful: height and diameter depends on the species. This is indeed also the right result, in fact the correct order is:
$$
[Species][Diameter|Species][Height|Species]
$$
### Alarm
Alarm is a famous dataset used as testbench for structure building algorithms: it has $37$ variables that can be at most 4-level. In particular it has a layered structure, that let us understand the differences between using a layered and not-layered order in the hypothesis of $K2$
```{r}
al.data <- from_str_to_num(alarm)
al.names <- colnames(al.data)
order <- c(18, 20, 3, 9, 15, 23, 36, 21, 19, 17, 16, 30, 11, 35, 22, 29,
2, 24, 0, 1, 27, 25, 12, 34, 14, 33, 31, 10, 32, 13, 26, 28, 6, 5, 7, 8, 4)+1
layered <- list( c(16, 24, 22, 21,23, 12, 19, 20, 18, 17, 30, 28),
c(37, 10, 31, 4, 3, 25, 26), c(36, 2, 1), c(13, 35), c(34, 15),
c(33, 32), c(11, 14), c(27), c(29), c(7, 6, 9, 8), c(5))
al.struct <- K2(al.data, 4, 4, order)
al.lay.struct <- K2(al.data, 4, 4, layered)
head(al.data)
```
```{r}
al.diff <- sum(abs(al.struct-al.lay.struct))
cat('Number of different arcs between the two structures:', al.diff, '\n')
```
```{r}
# True graph
modelstring = paste0("[HIST|LVF][CVP|LVV][PCWP|LVV][HYP][LVV|HYP:LVF][LVF]",
"[STKV|HYP:LVF][ERLO][HRBP|ERLO:HR][HREK|ERCA:HR][ERCA][HRSA|ERCA:HR][ANES]",
"[APL][TPR|APL][ECO2|ACO2:VLNG][KINK][MINV|INT:VLNG][FIO2][PVS|FIO2:VALV]",
"[SAO2|PVS:SHNT][PAP|PMB][PMB][SHNT|INT:PMB][INT][PRSS|INT:KINK:VTUB][DISC]",
"[MVS][VMCH|MVS][VTUB|DISC:VMCH][VLNG|INT:KINK:VTUB][VALV|INT:VLNG]",
"[ACO2|VALV][CCHL|ACO2:ANES:SAO2:TPR][HR|CCHL][CO|HR:STKV][BP|CO:TPR]")
dag = model2network(modelstring)
adj <- amat(dag)
adj <- adj[al.names, al.names] # Reordering the adjacency matrix with our column order
```
```{r}
# Checking for errors: notice that we have to transpose our matrix
al.norm.diff <- sum(abs(t(al.struct)-adj))
al.lay.diff <- sum(abs(t(al.lay.struct)-adj))
cat('Number of different arcs between the normal K2 and the true network:', al.norm.diff, '\n')
cat('Number of different arcs between the layered K2 and the true network:', al.lay.diff, '\n')
```
```{r}
al.net <- graph_from_adjacency_matrix( t(al.struct), mode="directed")
al.lay.net <- graph_from_adjacency_matrix( t(al.lay.struct), mode="directed")
par(mfrow = c(1,2))
plot(al.net, vertex.label = al.names, main='Alarm normal network ')
plot(al.lay.net, vertex.label = al.names, main='Alarm layered network')
```
Analyzing the result we see that our algorithm gives optimal results also for a complex network such as ALARM. We also see that, in a network with a layered structure as ALARM using it improves the performances, bringing us to the same number of errors as the paper of $K2$.
## Scaling
We want to analyze now how the computational time of the algorithm scales with its parameters:
- the number of samples $m$;
- the number of variables $n$;
### Number of samples
We use ALARM for this analysis.
```{r}
m.max <- nrow(al.data)
M <- seq(1000, m.max, by=1000)
m.times <- rep(0, 20)
for(i in 1:length(M)){
t.start <- Sys.time()
al.struct <- K2(al.data[1:M[i], ], 4, 4, order)
t.stop <- Sys.time()
m.times[i] <- t.stop-t.start
}
```
```{r}
plot(M, m.times, pch=20, main = 'Time scaling wrt the number of samples',
xlab='Number of samples m', ylab='Computational time [s]', col='navy', cex=2)
```
We see a linear scaling with $m$, that is what we wanted.
### Number of variables
We fix $m$ to the minimum $m$ among the three databases studied, and so to $m_{min}=m_{liz}=409$.
```{r}
N <- c(3, 6, 37)
n.times <- rep(0, 3)
m.min <- nrow(lz.data)
t.start <- Sys.time()
lz.struct <- K2(lz.data, 2, 2, c(1,2,3))
t.stop <- Sys.time()
n.times[1] <- t.stop-t.start
t.start <- Sys.time()
lt.struct <- K2(lt.data[1:m.min, ], 2, 3, c(1,3,6,2,4,5))
t.stop <- Sys.time()
n.times[2] <- t.stop-t.start
t.start <- Sys.time()
al.struct <- K2(al.data[1:m.min, ], 4, 4, order)
t.stop <- Sys.time()
n.times[3] <- t.stop-t.start
```
```{r}
plot(N, n.times, pch=20, main = 'Time scaling wrt the number of variables',
xlab='Number of variables n', ylab='Computational time [s]', col='navy', cex=2)
```
We have too few samples to actually make a reliable analysis. Nevertheless we can guess that the relation is not linear, and this is in agreement with the paper where it was found a quartic relation.
# Bnstruct
Bnstruct is an R package developed by Francesco Sambo and Alberto Franzin that provides objects and methods for learning the structure and parameters of a Bayesan Network in various situations, such as the presence of missing data.
One of the most important feature of Bnstruct is the possibility of performing *imputation*, with which we infer the missing data using a k-Nearest Neighbor algorithm.
It provides $5$ different algorithms for the structure learning:
- Silander-Myllymaki (sm), which is an exact, and so really slow, method;
- Max-Min Parent-and-Children, a constraint-based heuristic approach that discovers only the presence of edges but not their directionality;
- Hill Climbing method, another heuristic method;
- Max-Min Hill-Climbing (default one), which is a combination of the previous two, discovering first the skeleton of the network and then performing a greedy evaluation to find the directionality;
- Structural Expectation-Maximization (sem), for learning from a network with missing values.
Our goal is to code a wrapper for our $K2$ algorithm, such that it can be used in BNstruct. In particular we want that:
- It takes as input a BNdatasets;
- Gives as output a BN object with the correct adjacency matrix.
```{r}
bnstruct.K2 <- function(data.bn, max.parents, ordering){
# Parameters
# - data.bn: BNdataset
# Dataset with the format of the BNstruct package. It contains m samples from n variables. Can contain missing data.
# - max.parents: numeric
# Maximum number of parents for a given node
# - ordering : vector of size n or list of n1 vectors of total size n
# If vector must contain a valid permutation of the integers [1, 2 ... n], representing the "prior" ordering of variables,
# such that the variable with index ordering[i] can have as possible parents only variables with index ordering[j] with j < i
# (i.e. variables that "appear before it" in the provided ordering).
# If list must contains the layers of the networks, representing the "prior" ordering of variables,
# such that the variable with index ordering[[i]] can have as possible parents only variables in the previous layers, and so
# with index ordering[[j]] with j<i. Notice that n1<=n, and if it is equal we come back to the previous case.
#
# Output
# - net: BN
# BN object containing the input dataset and the adjacency matrix computed with K2
if( has.imputed.data( data.bn)) # Pick imputed data if possible
data <- data.frame( imputed.data( data.bn) - 1 )
else{
data <- complete(data.bn) # Else eliminate missing data
data <- data.frame( [email protected] - 1 )
}
adj.matrix <- K2(data, max.parents, r, ordering) # Performing K2 algorithm
net <- BN(data.bn) # Creating BN object with the input data
net@dag <- t(adj.matrix) # Defining the BN adjacency matrix (Notice the t() )
return(net)
}
```
We make now some examples of use using the two datasets provided with BNstruct.
We start with Asia, a small dataset about lung disease and visits to Asia.
```{r}
asia.data <- asia()
asia.names <- asia.data@variables
asia.struct <- bnstruct.K2(asia.data, 2, list(c( 1, 3), c(2, 4), c(6, 5), c(7,8) ) )
```
```{r}
asia.net <- graph_from_adjacency_matrix(asia.struct@dag, mode="directed")
plot(asia.net, vertex.label = asia.names, main='Asia network')
```
The results with the analysis are not the best ones, in fact there some errors. However the documentation of BNlearn states that standard learning algorithms are not able to recover the true structure of the Asia network because of the presence of a node (E) with conditional probabilities equal to both 0 and 1. The correct structure would be:
$$
[A][S][T|A][L|S][B|S][D|B,E][E|T,L][X|E]
$$
Nevertheless our test is a success, since we are able to apply our algorithm directly to a dataset in the format used by BNstruct.
We proceed to understand if it can also be used with datasets with imputed/missing data, using the child dataset.
```{r}
ch.data <- child()
ch.data <- impute(ch.data)
ch.names <- ch.data@variables
ch.order <- list( c(8), c(2), c(5, 6, 7, 9, 1, 4), c(3, 10:15), c(16:20))
ch.struct <- bnstruct.K2(ch.data, 2, ch.order )
```
```{r}
ch.net <- graph_from_adjacency_matrix(ch.struct@dag, mode="directed")
plot(ch.net, vertex.label = ch.names, main='Child network')
```
The only links that are missing from the ones presented in the paper are:
- CardiacMixing -> HypDistrib
- LungParench -> HypoxianO2
- Disiase -> BirthAsphyxia
We so commit only 3 errors, which is a good result for a greedy algorithm.
We can conclude that we are able to proceed also in the case of an imputed dataset.
```{r}
```