Examples&BNstruct.Rmd

---
jupyter:
  jupytext:
    formats: ipynb,Rmd
    text_representation:
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.2'
      jupytext_version: 1.4.1
  kernelspec:
    display_name: R
    language: R
    name: ir
---

# Performance tests


We have analyzed the algorithm from a theoretical point of view and tried it on a simple database to understand how it works. Now we want to test it even further: we will use different datasets obtained from  the bnlearn website (https://www.bnlearn.com/bnrepository/). This resource gives us also the correct network structure, such that we can be sure of the performances of our algorithm.

This notebook requires the packages "bnstruct" and "bnlearn" <br>
`install.packages('bnstruct')`
`install.packages('bnlearn')`

```{r}
library('bnlearn')
library('bnstruct')
options(repr.plot.width=16, repr.plot.height=8)
```

```{r}
log.fact <- function(m, r){
    # Return the logarithm of the first m+r-1 factorials
    # m: int, number of cases in the dataset
    # r: int, number of possible values a variable can assume
    fact <- log( c(1, 1:(m+r-1)) )

    for (i in 4:length(fact)) { #first 3 are already correct
        fact[i] <- fact[i] + fact[i-1]
    }
    return(fact)
}
```

```{r}
f <- function(index, parents, database, r, factorials) { #Rename the variables
    
    #Compute N_{ijk}
    p <- length(parents)
    m <- nrow(database)
    todec <- r ** (0:max(0,p-1)) 

    rowcol <- array(dim = m)
    
    index.k <- database[index] + 1 #(0, 1) -> (1, 2)

    if (p > 0)
        index.j <- as.matrix(database[parents])%*%todec + 1 # 10 -> 2
    else 
        index.j <- rep(1,m)

    
    indexes <- cbind(index.j, index.k) 
    indexes <- indexes[ order(indexes[,1], indexes[,2]),  ]
    
    d <- abs( diff( as.matrix(indexes) ) )
    d1 <- d[,1] + d[,2]
    
    idx.unique <- indexes[c(0, which(d1>0), length(d1)+1), ]
    N_ijk <- diff( c(0,which(d1>0), length(d1)+1) )
    
    N <- data.frame('Row'=idx.unique[,1], 'Col'=idx.unique[,2], 'Value'=N_ijk)
    #print(N)
    
    #print(N)           
    #Compute N_{ij}
    Nj <- aggregate(.~Row,data=N,sum)[,c(3)]

    
    #Compute logN - already row summed
    N$Value <- factorials[N$Value+1]  
    logN <- aggregate(.~Row,data=N,sum)[,c(3)]
    

    factor.first  <- factorials[r] #r_i-1+1
    factor.second <- factorials[Nj+r] #+r_i -1 +1 (because of the indices)
    factor.third  <- logN

    result <- sum(factor.first - factor.second + factor.third) #factor.first could be brought outside to gain precision (and multiply by qi)

    return(result)
}
```

```{r}
# K2 algorithm implementation

# Parameters:
# - database : data.frame of size (m, n).     #? Since it is always numeric, we could use a matrix also here!
#   Dataset containing m observations of n variables, with no missing values.
#   All values must be integers between 0 and r-1.
# - u : integer
#   Maximum number of parents for each node.
# - ordering : vector of size n or list of n1 vectors of total size n
#   If vector must contain a valid permutation of the integers [1, 2 ... n], representing the "prior" ordering of variables,
#   such that the variable with index ordering[i] can have as possible parents only variables with index ordering[j] with j < i
#   (i.e. variables that "appear before it" in the provided ordering).
#   If list must contains the layers of the networks, representing the "prior" ordering of variables,
#   such that the variable with index ordering[[i]] can have as possible parents only variables in the previous layers, and so
#   with index ordering[[j]] with j<i. Notice that n1<=n, and if it is equal we come back to the previous case. 

# Output:
# adj : matrix of size (n, n)
# Adjacency matrix of the most probable Directed Acyclic Graph found given the evidence in the database.
# adj[i,j] is 1 if there is a connection from node j to node i (i.e. if j is a parent of i)  #? We could take the transpose,
# since it is most common to use i -> j

K2 <- function(database, u, r, ordering) {    
    n <- ncol(database)
    m <- nrow(database)
    adj <- matrix(data = 0, nrow = n, ncol = n)
    factorials <- log.fact(m, r)
    
    for (i in 1:n) {
        i.parents <- c()
        log.p.old <- f(i, i.parents, database, r, factorials) #log-Probability of a structure with i disconnected
        
        OKToProceed <- TRUE
        
        # Sostituirei la scelta delle posizioni chiamando una funzione definita fuori
        # prec(i, ordering) per non appesantire troppo il codice
        if (class(ordering)=='list'){ # Layer ordering, each element of the list is a layer
            
            for (h in 1:length(ordering)){
               if (i %in% ordering[[h]] ){
                   i.pos <- h
                   break
               } 
            }
            if( i.pos == 1){
                next
            } else {
                i.parents.candidates <- unlist( ordering[1:i.pos-1] )
            }
        } else { # Normal ordering, the input is a vector
            
            #Compute vector of candidate parents for i
            i.pos <- which(ordering == i) #Position of i in the ordering. Only variables before this position can be parents of i
        
            if (i.pos == 1) { #If i is at the start of the ordering, there are no parents to check
                next 
            } else {
                i.parents.candidates <- ordering[1:i.pos-1]
            }
        }
        
        while (OKToProceed & length(i.parents) < u) {
            log.p.new <- -Inf
            newparent.index <- NA
            
            #Select the node from the candidate list that maximizes the structure's probability (greedy algorithm)
            for (candidate in i.parents.candidates) {
                log.p.candidate <- f(i, c(i.parents, candidate), database, r, factorials)
                
                if (log.p.candidate > log.p.new) {
                    log.p.new <- log.p.candidate
                    newparent.index <- candidate
                }
            }
        
            #Accept the new parent candidate only if it increases the total structure's probability
            if (log.p.new > log.p.old) {
                log.p.old <- log.p.new
                i.parents <- c(i.parents, newparent.index)
                adj[i, newparent.index] = 1
                
                #Remove the added parent from the candidates list
                i.parents.candidates <- i.parents.candidates[-newparent.index]
            } else { #If probability does not increase, stop adding parents to i
                OKToProceed <- FALSE
            }
        }
    }
    
    return(adj)
}
```

## Exemples


### Learning test
This dataset is a simple dataset used by the bnlearn author to test its algorithms. However it is quite interesting in our case: we have $6$ variable, where the first 5 variables are with $3$-levels and the last one with only $2$ levels. We will understand if our implementation of K$2$ is able to manage non-binary variables and also datasets with variables with different values.

```{r}
from_str_to_num <- function(data){
    # Parameters:
    # - data: data.frame of string factors
    # Output:
    # - data.num: data.frame of numeric factors
    data.num <- data.frame(sapply(data, as.numeric)-1) 
    # -1 is needed for our convention with the levels starting from 0
    return(data.num)
    }
```

```{r}
lt.data <- from_str_to_num(learning.test)
lt.names <- colnames(lt.data)
lt.struct <- K2(lt.data, u=2, r=3, c(1,3,6,2,4,5))
head(lt.data)
```

```{r}
# Notice that our convention for the adjacency matrix is the transpose of the usual one.
# This means that before the plotting we must apply t(adj)
lt.net <- graph_from_adjacency_matrix(t(lt.struct), mode="directed")
plot(lt.net, vertex.label = lt.names, main='Learning test network')
```

Notice that the correct order is:
$$
[A][C][F][B|A][D|A,C][E|B,F]
$$
where we identify with $[\cdot]$ a node with no parents and $[i|j]$ when node $i$ has node $j$ as parent. The order recovered from our algorithm is exactly the original one.


### Lizards
Lizard is a real dataset containing the species, the perch height and the perch diameter of different lizards. It only has three two-level variables, but it is indeed an important test since it is the first attempt to use $K2$ on real world data.

Sources: <br>
Schoener TW (1968). "The Anolis Lizards of Bimini: Resource Partitioning in a Complex Fauna". Ecology, 49(4):704–726.


```{r}
lz.data <- from_str_to_num(lizards)
lz.names <- colnames(lz.data)
lz.struct <- K2(lz.data, 2, 2, c(1,2,3))
head(lz.data)
```

```{r}
lz.net <- graph_from_adjacency_matrix( t(lz.struct), mode="directed")
plot(lz.net, vertex.label = lz.names, main='Lizards network')
```

The result is meaningful: height and diameter depends on the species. This is indeed also the right result, in fact the correct order is:
$$
[Species][Diameter|Species][Height|Species]
$$


### Alarm


Alarm is a famous dataset used as testbench for structure building algorithms: it has $37$ variables that can be at most 4-level. In particular it has a layered structure, that let us understand the differences between using a layered and not-layered order in the hypothesis of $K2$

```{r}
al.data <- from_str_to_num(alarm)
al.names <- colnames(al.data)

order <- c(18, 20, 3, 9, 15, 23, 36, 21, 19, 17, 16, 30, 11, 35, 22, 29, 
           2, 24, 0, 1, 27, 25, 12, 34, 14, 33, 31, 10, 32, 13, 26, 28, 6, 5, 7, 8, 4)+1

layered <- list( c(16, 24, 22, 21,23, 12, 19, 20, 18, 17, 30, 28), 
                c(37, 10, 31, 4, 3, 25, 26), c(36, 2, 1), c(13, 35), c(34, 15),
               c(33, 32), c(11, 14), c(27), c(29), c(7, 6, 9, 8), c(5))

al.struct <- K2(al.data, 4, 4, order)
al.lay.struct <- K2(al.data, 4, 4, layered)
head(al.data)
```

```{r}
al.diff <- sum(abs(al.struct-al.lay.struct))
cat('Number of different arcs between the two structures:', al.diff, '\n')
```

```{r}
# True graph
modelstring = paste0("[HIST|LVF][CVP|LVV][PCWP|LVV][HYP][LVV|HYP:LVF][LVF]",
  "[STKV|HYP:LVF][ERLO][HRBP|ERLO:HR][HREK|ERCA:HR][ERCA][HRSA|ERCA:HR][ANES]",
  "[APL][TPR|APL][ECO2|ACO2:VLNG][KINK][MINV|INT:VLNG][FIO2][PVS|FIO2:VALV]",
  "[SAO2|PVS:SHNT][PAP|PMB][PMB][SHNT|INT:PMB][INT][PRSS|INT:KINK:VTUB][DISC]",
  "[MVS][VMCH|MVS][VTUB|DISC:VMCH][VLNG|INT:KINK:VTUB][VALV|INT:VLNG]",
  "[ACO2|VALV][CCHL|ACO2:ANES:SAO2:TPR][HR|CCHL][CO|HR:STKV][BP|CO:TPR]")
dag = model2network(modelstring)

adj <- amat(dag) 
adj <- adj[al.names, al.names] # Reordering the adjacency matrix with our column order
```

```{r}
# Checking for errors: notice that we have to transpose our matrix
al.norm.diff <- sum(abs(t(al.struct)-adj))
al.lay.diff <- sum(abs(t(al.lay.struct)-adj))

cat('Number of different arcs between the normal K2 and the true network:', al.norm.diff, '\n')
cat('Number of different arcs between the layered K2 and the true network:', al.lay.diff, '\n')
```

```{r}
al.net <- graph_from_adjacency_matrix( t(al.struct), mode="directed")
al.lay.net <- graph_from_adjacency_matrix( t(al.lay.struct), mode="directed")

par(mfrow = c(1,2))
plot(al.net, vertex.label = al.names, main='Alarm normal network ')
plot(al.lay.net, vertex.label = al.names, main='Alarm layered network')
```

Analyzing the result we see that our algorithm gives optimal results also for a complex network such as ALARM. We also see that, in a network with a layered structure as ALARM using it improves the performances, bringing us to the same number of errors as the paper of $K2$.


## Scaling


We want to analyze now how the computational time of the algorithm scales with its parameters: 
- the number of samples $m$;
- the number of variables $n$;


### Number of samples


We use ALARM for this analysis.

```{r}
m.max <- nrow(al.data)
M <- seq(1000, m.max, by=1000)
m.times <- rep(0, 20)

for(i in 1:length(M)){
    t.start <- Sys.time()
    
    al.struct <- K2(al.data[1:M[i], ], 4, 4, order)
    
    t.stop <- Sys.time()
    m.times[i] <- t.stop-t.start
}
```

```{r}
plot(M, m.times, pch=20, main = 'Time scaling wrt the number of samples',
    xlab='Number of samples m', ylab='Computational time [s]', col='navy', cex=2)
```

We see a linear scaling with $m$, that is what we wanted.


### Number of variables 

We fix $m$ to the minimum $m$ among the three databases studied, and so to $m_{min}=m_{liz}=409$.

```{r}
N <- c(3, 6, 37)

n.times <- rep(0, 3)
m.min <- nrow(lz.data)

t.start <- Sys.time()
lz.struct <- K2(lz.data, 2, 2, c(1,2,3))
t.stop <- Sys.time()
n.times[1] <- t.stop-t.start

t.start <- Sys.time()
lt.struct <- K2(lt.data[1:m.min, ], 2, 3, c(1,3,6,2,4,5))
t.stop <- Sys.time()
n.times[2] <- t.stop-t.start

t.start <- Sys.time()
al.struct <- K2(al.data[1:m.min, ], 4, 4, order)
t.stop <- Sys.time()
n.times[3] <- t.stop-t.start
```

```{r}
plot(N, n.times, pch=20, main = 'Time scaling wrt the number of variables',
    xlab='Number of variables n', ylab='Computational time [s]', col='navy', cex=2)
```

We have too few samples to actually make a reliable analysis. Nevertheless we can guess that the relation is not linear, and this is in agreement with the paper where it was found a quartic relation.


# Bnstruct


Bnstruct is an R package developed by Francesco Sambo and Alberto Franzin that provides objects and methods for learning the structure and parameters of a Bayesan Network in various situations, such as the presence of missing data.

One of the most important feature of Bnstruct is the possibility of performing *imputation*, with which we infer the missing data using a k-Nearest Neighbor algorithm.

It provides $5$ different algorithms for the structure learning:
- Silander-Myllymaki (sm), which is an exact, and so really slow, method;
- Max-Min Parent-and-Children, a constraint-based heuristic approach that discovers only the presence of edges but not their directionality;
- Hill Climbing method, another heuristic method;
- Max-Min Hill-Climbing (default one), which is a combination of the previous two, discovering first the skeleton of the network and then performing a greedy evaluation to find the directionality;
- Structural Expectation-Maximization (sem), for learning from a network with missing values.

Our goal is to code a wrapper for our $K2$ algorithm, such that it can be used in BNstruct. In particular we want that:
- It takes as input a BNdatasets;
- Gives as output a BN object with the correct adjacency matrix.

```{r}
bnstruct.K2 <- function(data.bn, max.parents, ordering){
    # Parameters
    # - data.bn: BNdataset
    #   Dataset with the format of the BNstruct package. It contains m samples from n variables. Can contain missing data.
    # - max.parents: numeric
    #   Maximum number of parents for a given node
    # - ordering : vector of size n or list of n1 vectors of total size n
    #   If vector must contain a valid permutation of the integers [1, 2 ... n], representing the "prior" ordering of variables,
    #   such that the variable with index ordering[i] can have as possible parents only variables with index ordering[j] with j < i
    #   (i.e. variables that "appear before it" in the provided ordering).
    #   If list must contains the layers of the networks, representing the "prior" ordering of variables,
    #   such that the variable with index ordering[[i]] can have as possible parents only variables in the previous layers, and so
    #   with index ordering[[j]] with j<i. Notice that n1<=n, and if it is equal we come back to the previous case. 
    #
    # Output
    # - net: BN
    #   BN object containing the input dataset and the adjacency matrix computed with K2
    
    if( has.imputed.data( data.bn)) # Pick imputed data if possible
        data <- data.frame( imputed.data( data.bn) - 1 )
    else{
        data <- complete(data.bn) # Else eliminate missing data
        data <- data.frame( data@raw.data - 1 ) 
    }
    adj.matrix <- K2(data, max.parents, r, ordering) # Performing K2 algorithm
        
    net <- BN(data.bn) # Creating BN object with the input data
    net@dag <- t(adj.matrix) # Defining the BN adjacency matrix (Notice the t() )
    
    return(net) 
}
```

We make now some examples of use using the two datasets provided with BNstruct.


We start with Asia, a small dataset about lung disease and visits to Asia.

```{r}
asia.data <- asia()
asia.names <- asia.data@variables
asia.struct <- bnstruct.K2(asia.data, 2, list(c( 1, 3), c(2, 4), c(6, 5), c(7,8) ) )
```

```{r}
asia.net <- graph_from_adjacency_matrix(asia.struct@dag, mode="directed")
plot(asia.net, vertex.label = asia.names, main='Asia network')
```

The results with the analysis are not the best ones, in fact there some errors. However the documentation of BNlearn states that standard learning algorithms are not able to recover the true structure of the Asia network because of the presence of a node (E) with conditional probabilities equal to both 0 and 1. The correct structure would be:
$$
[A][S][T|A][L|S][B|S][D|B,E][E|T,L][X|E]
$$
Nevertheless our test is a success, since we are able to apply our algorithm directly to a dataset in the format used by BNstruct.

We proceed to understand if it can also be used with datasets with imputed/missing data, using the child dataset.

```{r}
ch.data <- child()
ch.data <- impute(ch.data)
ch.names <- ch.data@variables
ch.order <- list( c(8), c(2), c(5, 6, 7, 9, 1, 4), c(3, 10:15), c(16:20))

ch.struct <- bnstruct.K2(ch.data, 2, ch.order )
```

```{r}
ch.net <- graph_from_adjacency_matrix(ch.struct@dag, mode="directed")
plot(ch.net, vertex.label = ch.names, main='Child network')
```

The only links that are missing from the ones presented in the paper are:
- CardiacMixing -> HypDistrib
- LungParench -> HypoxianO2
- Disiase -> BirthAsphyxia

We so commit only 3 errors, which is a good result for a greedy algorithm.


We can conclude that we are able to proceed also in the case of an imputed dataset.

```{r}

```