Barren Land Analysis.Rmd

---
title: "Barren Land Analysis R Notebook"
output:
  html_notebook: default
---

## Setup

Install Tidyverse, load ggplot2

```{r}
#install.packages("tidyverse")
library(ggplot2)
```

## Visualizing the Problem Space

Run a test plot with a uniform distribution:

```{r}
df <- expand.grid(x = 0:399, y = 0:599)
df$z <- runif(nrow(df))
ggplot(df, aes(x, y, fill = z)) + geom_raster()
```

Set the land to be default fertile (1) and an area of barren land (0):

```{r}
df <- expand.grid(x = 0:399, y = 0:599)
df$z <- 1
# using test input {“0 292 399 307”}
df$z[df$x >= 0 & df$x <= 399 & df$y >= 292 & df$y <= 307] <- 0
ggplot(df, aes(x, y, fill = z)) + geom_raster()
```

I'm assuming the coordinates are inclusive for the barren land, but that may not be the case. Let's validate the areas manually and compare with the sample output. 

```{r}
#area of top
(599-307) * 400
#area of bottom
(292-0) * 400
```

Inclusive it is! Moving on. 

Let's make a function that takes in a set of coordinate strings, and sets the barrenness of the land. 

```{r}
setBarren <- function(df, barren_lands, val=0){
  for (bString in barren_lands){
    b <- unlist(strsplit(bString,"[ ]")) #split on spaces
    b <- as.numeric(b)                   #convert string to numbers
    df$z[df$x >= b[1] & df$x <= b[3] & df$y >= b[2] & df$y <= b[4]] <- val
  }
  return(df)
}

# reset our grid
df <- expand.grid(x = 0:399, y = 0:599)
df$z <- 1
df <- setBarren(df, c("48 192 351 207", "48 392 351 407", "120 52 135 547", "260 52 275 547"))
ggplot(df, aes(x, y, fill = z)) + geom_raster()
```

## Manual Verification

At this point, I'm a little confused. There are 240,000 square meters of land, and the second sample output has the largest area of 192,608 or ~80% of the land. 

The areas I'd expect an algorithm to find are:

The following two horizontal areas across the top or bottom have 20,800 square meters:

```{r}
df2 <- df
df2 <- setBarren(df2, c("0 0 399 51", "0 548 399 599"), .75)
ggplot(df2, aes(x, y, fill = z)) + geom_raster()
```

The following two vertical areas on the sides have 28,800 square meters:

```{r}
df2 <- df
df2 <- setBarren(df2, c("0 0 47 599", "352 0 399 599"), .75)
ggplot(df2, aes(x, y, fill = z)) + geom_raster()
```

And if you segment off each of the 9 areas A through I, then I show the square meters as shown:

A = 23,040
B = 24,000
C = 23,808
D = 22,200
E = 23,125
F = ...
G = ...
H = ...
I = ...

Areas B and C highlighted here: 

```{r}
df2 <- df
df2 <- setBarren(df2, c("136 408 259 599", "276 408 399 599"), .75)
ggplot(df2, aes(x, y, fill = z)) + geom_raster()
```

Wait a second here ... 

I was reading the instructions as largest _rectangle_ of fertile land. This is an incorrect reading of the initial instructions. We are simply looking for areas of _contiguous_ space.

If I exclude the center patch (the obvious 2nd biggest area), then the largest contiguous area would look like this:

```{r}
df2 <- df
df2 <- setBarren(df2, c("136 207 259 392"))
ggplot(df2, aes(x, y, fill = z)) + geom_raster()
```

And it's area would be:

```{r}
nrow(df2[df2$z==1,])
```

## Algorithm 

Next up is the algorithm. There are actually two strategies I'd like to investigate. The first is if there are any R-specific tools available to do this. If so, I'd like to go down that route. If I don't find anything in a reasonable amount of time then I'll likely build a solution similar to [Flood fill](https://en.wikipedia.org/wiki/Flood_fill)

#### R-specific tools

Well this seems promising: [Flood fill a region of an active device in R](https://www.r-bloggers.com/flood-fill-a-region-of-an-active-device-in-r/) and [floodFill](https://www.rdocumentation.org/packages/EBImage/versions/4.14.2/topics/floodFill) seems even more promising as it appears more straight-forward. Though it's based on an Image object which I might have trouble generalizing to the data frame I'm currently using.

```{r}
#source("https://bioconductor.org/biocLite.R")
#biocLite("EBImage")
#library("EBImage")
#floodFill(df2, c(0,0), .5)
#df3 <- floodFill(data.matrix(df2), c(1,1), .5)
#ggplot(as.data.frame(df3), aes(x, y, fill = z)) + geom_raster()
```

... commenting all of this out. I think there is likely a solution here, but I don't know quite enough to get all of the data into the form they are looking for. Might be a good pair-programming effort. 

### Flood fill

```{r}
# z=0 is barren land
# z=1 is fertile land
# z=.75 is marked fertile land that we are currently counting
dfFlood <- df
floodFill <- function(xCord, yCord) {
  zVal <- dfFlood[dfFlood$x==xCord & dfFlood$y==yCord,]$z
  if (is.na(zVal) | zVal == .75 | zVal == 0) { return() }
  if (zVal == 1) {
    dfFlood[dfFlood$x==xCord & dfFlood$y==yCord,]$z <<- .75
    # left
    if (xCord > 0) { floodFill(xCord-1, yCord) }   #left
    if (xCord < 399) { floodFill(xCord+1, yCord) } #right
    if (yCord < 599) { floodFill(xCord, yCord+1) } #up
    if (yCord > 0) { floodFill(xCord, yCord-1) }   #down
  }
}

floodFill(0,0)

```

Out of memory! Looks like I ran out of memory on this computer after just 1% of the problem space. You can see a thin section of darker blue along the bottom of the following image that shows how many coordinates were calculated along with the % complete calculation:

```{r}
ggplot(dfFlood, aes(x, y, fill = z)) + geom_raster()
100 * nrow(dfFlood[dfFlood$z==.75,]) / 240000
```

Let's add some complexity and approach this as a queue:

```{r}
dfFlood <- df
Q <- data.frame()

# z=0 is barren land
# z=1 is fertile land
# z=.75 is marked fertile land that we are currently counting
floodFill <- function(xCord, yCord) {
  zVal <- dfFlood[dfFlood$x==xCord & dfFlood$y==yCord,]$z
  if (is.na(zVal) | zVal == .75 | zVal == 0) { 
    return()
  } else { 
    dfFlood[dfFlood$x==xCord & dfFlood$y==yCord,]$z <<- .75
    if (xCord > 0) { Q <<- rbind(Q,c(x=xCord-1, y=yCord)) }   #left
    if (xCord < 399) { Q <<- rbind(Q,c(x=xCord+1, y=yCord)) } #right
    if (yCord < 599) { Q <<- rbind(Q,c(x=xCord, y=yCord+1)) } #up
    if (yCord > 0) { Q <<- rbind(Q,c(x=xCord, y=yCord-1)) }   #down
  }
}
```

And then to kick this version off:

```{r}
# add the origin to the queue to get us started
# technically this _sets_ the origin, and adds the two adjacent (up and right) to the queue
floodFill(0,0) 

# process the queue
while ( nrow(Q) > 0 ) {
  floodFill(Q[1,]$X1, Q[1,]$X0) #set and add neighbors to queue
  Q <<- Q[-1,]                  #remove row 1
}
```

Hmmm ... this ran all night and didn't return. But looking at the plot, it appears to be on the right path with the dark blue section:

```{r}
ggplot(dfFlood, aes(x, y, fill = z)) + geom_raster()
```

On this Mid 2011 iMac, this took close to 2 hours to complete when I first ran it. Not sure why this didn't complete overnight on this run. The results were correct on the first run: 

```{r}
area <- nrow(dfFlood[dfFlood$z==.75,])
area
```

This area is a match to the area shown in the Sample Output. 

### Save the result and find the next area

```{r}
results <- c()

# save the last found area
results <- c(results, area)
results

# clear out the found area
dfFloodLoop <- dfFlood
dfFloodLoop[dfFloodLoop$z==.75,]$z <- 0

# find next fertile (1) if it exists
nextFertile <- dfFloodLoop[dfFloodLoop$z==1,][1,]

# reset the queue and populate with the nextFertile coordinates
Q <- data.frame()
floodFill(nextFertile$x,nextFertile$y) 
```

```{r}
# process the queue
while ( nrow(Q) > 0 ) {
  floodFill(Q[1,]$X135, Q[1,]$X208) #set and add neighbors to queue
  Q <<- Q[-1,]                      #remove row 1
}
```

```{r}
ggplot(dfFloodLoop, aes(x, y, fill = z)) + geom_raster()
```

```{r}
area <- nrow(dfFlood[dfFlood$z==.75,])
results <- c(results, area)
results
```

Whoops. Think I attempted to process that area with the wrong data frame. Well, I needed to put it all into a loop anyway, so here goes:

### Reset and loop

```{r}
# z=0 is barren land
# z=1 is fertile land
# z=.75 is marked fertile land that we are currently counting
floodFill <- function(xCord, yCord) {
  zVal <- dfFlood[dfFlood$x==xCord & dfFlood$y==yCord,]$z
  if (is.na(zVal) | zVal == .75 | zVal == 0) { 
    return()
  } else { 
    dfFlood[dfFlood$x==xCord & dfFlood$y==yCord,]$z <<- .75
    if (xCord > 0) { Q[nrow(Q)+1,] <<- c(xCord-1, yCord) }   #left
    if (xCord < 399) { Q[nrow(Q)+1,] <<- c(xCord+1, yCord) } #right
    if (yCord < 599) { Q[nrow(Q)+1,] <<- c(xCord, yCord+1) } #up
    if (yCord > 0) { Q[nrow(Q)+1,] <<- c(xCord, yCord-1) }   #down
  }
}
```

```{r}
prepQ <- function() {
  nextFertile <- dfFlood[dfFlood$z==1,][1,] # find next area
  floodFill(nextFertile$x,nextFertile$y)    # prime the Q
}
```

```{r}
# reset our grid
df <- expand.grid(x = 0:399, y = 0:599)
df$z <- 1
df <- setBarren(df, c("48 192 351 207", "48 392 351 407", "120 52 135 547", "260 52 275 547"))
```

```{r}
# reset results and prep the Q
results <- c()
Q <- data.frame(x=integer(), y=integer())
dfFlood <- df
```

```{r}
prepQ()
```

```{r}
while ( nrow(Q) > 0 ) {
  
  # process the queue and set all of the area to .75
  while ( nrow(Q) > 0 ) {
    floodFill(Q[1,]$x, Q[1,]$y) #set and add neighbors to queue
    Q <<- Q[-1,]                #remove row 1
  }
  area <- nrow(dfFlood[dfFlood$z==.75,]) # count the area
  results <<- c(results, area)           # save the area into the results
  dfFlood[dfFlood$z==.75,]$z <- 0        # mark the area as done

  prepQ()                                # find the next area if there is one
}
```

```{r}
results
```

Sweet. This looks like the right data. Let's put it in smallest to largest order:

```{r}
results[order(results)]
```

### Next Steps

Please continue in the [README](README.md) at the section, "Making this run-able on the command line".

Thank you,
Peter Edstrom
[peter@edstrom.net](mailto:peter@edstrom.net)
January 2018