Skip to content

Commit

Permalink
Redo with correct pAUC from maxnetic package
Browse files Browse the repository at this point in the history
  • Loading branch information
btupper committed Oct 13, 2023
1 parent ba32afa commit b552644
Show file tree
Hide file tree
Showing 25 changed files with 365 additions and 310 deletions.
295 changes: 165 additions & 130 deletions docs/modeling-01.html

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/modeling-01_files/figure-html/unnamed-chunk-8-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
190 changes: 97 additions & 93 deletions docs/modeling-02.html

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
13 changes: 10 additions & 3 deletions docs/search.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion functions/stars.R
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,5 @@ read_predictors = function(
} else {
x = do.call(c, append(xx, list(along = NA_integer_)))
}
x
st_to_180(x)
}
116 changes: 61 additions & 55 deletions modeling-01.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Basic modeling"
cache: true
cache: false
---

So at this point we have point data for observation and background that have been joined with common environmental covariates (aka predictors). Here we show the basic steps taken to prepare, build and assess a model. Later, we'll try more sophisticated modeling, such as modeling by month or splitting the data into training-testing groups.
Expand Down Expand Up @@ -116,18 +116,21 @@ hist(prediction, xlab = "prediction", main = "Basic Model")

#### How did it do?

We can use some utilities in the [maxnetic](https://github.com/BigelowLab/maxnetic) package to help us assess the model. First, we need to create a table with two columns: `label` and `pred`. Label is the simple a vector of 0/1 indicating that the predicted value is known to be either background or presence. We already have that in our `input_vector`. Pred is simple the 0-1 scale predicted value. Once we have that we can craft a [receiver operator characteristic curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute it's [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).
We can use some utilities in the [maxnetic](https://github.com/BigelowLab/maxnetic) package to help us assess the model. The `pAUC()` function will compute statistics, include a presence-only AUC value. We need to pass it two items - the universe of predictions and the predictions for just the presence points.

```{r}
x = dplyr::tibble(label = input_vector, pred = as.vector(prediction))
plot_ROC(x, title = "v1.0 Basic Model")
ix = input_vector > 0
pauc = maxnetic::pAUC(prediction, prediction[ix])
plot(pauc, title = "v1.0 Basic Model")
```
Overall, this is telling us that the model isn't especially strong as a prediction tool, but it is much better than a 50-50 guess (that's when AUC is close to 0.5, and the curve follows the light grey line). Learn more about ROC and AUC [here](https://rviews.rstudio.com/2019/01/17/roc-curves/).


### Predict with rasters

We can also predict using raster inputs using our basic model. Let's read in rasters for each month of 2018, and then run a prediction for each month.
We can also predict using raster inputs using our basic model. Let's read in rasters for each month of 2019, and then run a prediction for each month.

We provide a function `read_predictors()` that will read and bind the rasters together for you given the filtered databases and paths. So, first we define the paths and filter the databases to point to just the months in 2019.

```{r}
dates = as.Date(c("2019-01-01", "2019-12-31"))
Expand All @@ -136,14 +139,6 @@ sst_path = "data/oisst"
sst_db = oisster::read_database(sst_path) |>
dplyr::arrange(date) |>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
sst = sst_db |>
oisster::compose_filename(path = sst_path) |>
stars::read_stars(along = list(time = sst_db$date)) |>
rlang::set_names("sst")|>
st_to_180()
wind_path = "data/nbs"
wind_db = nbs::read_database(wind_path) |>
Expand All @@ -153,33 +148,18 @@ wind_db = nbs::read_database(wind_path) |>
u_wind_db = wind_db |>
dplyr::filter(param == "u_wind")|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
u_wind = u_wind_db |>
nbs::compose_filename(path = wind_path) |>
stars::read_stars(along = list(time = u_wind_db$date)) |>
rlang::set_names("u_wind") |>
st_to_180()
v_wind_db = wind_db |>
dplyr::filter(param == "v_wind")|>
dplyr::filter(dplyr::between(date, dates[1], dates[2]))
v_wind = v_wind_db |>
nbs::compose_filename(path = wind_path) |>
stars::read_stars(along = list(time = v_wind_db$date)) |>
rlang::set_names("v_wind") |>
st_to_180()
```


Once we have them in hand we need to bind them together. But we need to attend to common but important issue. The `sst` rasters and `windspeed` rasters have different extents. We can't bind them together until we warp one set to match the other. Let's warp `sst` to match `u_wind`. And then we can bind them together.
```{r}
sst_warped = stars::st_warp(sst, u_wind)
x = list(sst_warped, u_wind, v_wind)
predictors = do.call(c, append(x, list(along = NA_integer_)))
predictors = read_predictors(sst_db = sst_db,
u_wind_db = u_wind_db,
v_wind_db = v_wind_db)
predictors
```

Now we can run the prediction.
You can see that we have the rasters in one object of three attributes (`sst`, `u_wind` and `v_wind`) each with 12 layers (Jan 2019 - Dec 2019). Now we can run the prediction.

```{r}
pred = predict(model, predictors, type = 'cloglog')
Expand All @@ -202,7 +182,7 @@ Well, that certainly looks appealing with higher likelihood of near shore observ

#### How did it do?

To compute an ROC and AUC for each month, we have a little bit of work to do. We need to extract the observations and background for each month from the prediction maps. These we can then pass to the `plot_ROC()` function.
To compute an ROC and AUC for each month, we have a little bit of work to do. We need to extract the observations locations for each month from the prediction maps. These we can then plot.

:::{.callout-note}
We have to modify the date for each point to be the first date of each month. That's because our predictors are monthlies.
Expand All @@ -214,45 +194,71 @@ test_obs = obs |>
dplyr::select(dplyr::all_of("date")) |>
dplyr::mutate(date = oisster::current_month(date))
test_bkg = bkg |>
dplyr::filter(dplyr::between(date, dates[1], dates[2])) |>
dplyr::select(dplyr::all_of("date")) |>
dplyr::mutate(date = oisster::current_month(date))
test_input = dplyr::bind_rows(test_obs, test_bkg)
x = stars::st_extract(pred, test_input, time_column = 'date') |>
x = stars::st_extract(pred, test_obs, time_column = 'date') |>
print()
```

Finally we can build a table that merges the prediction with the labels. We are going to add the name of the month to group by that.

```{r}
y = x |>
dplyr::mutate(label = c(rep(1, nrow(test_obs)), rep(0, nrow(test_bkg))),
month = factor(format(date, "%b"), levels = month.abb),
.before = 2) |>
sf::st_drop_geometry() |>
dplyr::select(dplyr::all_of(c("month", "label", "pred"))) |>
dplyr::mutate(month = factor(format(date, "%b"), levels = month.abb),
.before = 1) |>
dplyr::select(dplyr::all_of(c("month", "pred", "date"))) |>
dplyr::group_by(month)
dplyr::count(y, month, label) |>
print(n = 24)
dplyr::count(y, month) |>
print(n = 12)
```

Now how about one ROC plot for each month? Yikes! This requires a iterative approach, using `group_map()`, to compute the ROC for each month. We then follow with plot wrapping by the [patchwork](https://patchwork.data-imaginist.com/articles/guides/assembly.html#functional-assembly) package.

```{r}
#| width: "100%"
rocs = dplyr::group_map(y,
function(tbl, key){
maxnetic::plot_ROC(tbl, title = sprintf("%s, n = %i", key$month, nrow(tbl)),
xlab = "", ylab = "")
})
paucs = dplyr::group_map(y,
function(tbl, key, pred_rasters = NULL){
ix = key$month %in% month.abb
x = dplyr::slice(pred_rasters, "time", ix)
pauc = maxnetic::pAUC(x, tbl)
plot(pauc,title = key$month,
xlab = "", ylab = "")
}, pred_rasters = pred)
patchwork::wrap_plots(paucs, ncol = 4)
```

patchwork::wrap_plots(rocs, ncol = 4)
Hmmm. That's surprising, yes? Why during the summer months does our AUC go down when we have the most number of observations? That seems counter intuitive.

## Thinking about AUC

AUC is a diagnostic that provides a peek into the predictive power of a model. But what is it? An analogy is fitting a straight line to a small set of observations verses a large set of observations and then comparing the correlation coefficients. Here's a simple example using R's built-in dataset `cars` which is a data frame of 50 observations of speed and stopping distances of cars. We'll compute a linear model for the entire data set, and then a second for a small subsample of the data. (Learn more about linear models in R [here](https://rseek.org/?q=linear+models).)

```{r}
data("cars")
cars = dplyr::as_tibble(cars)
all_fit = lm(dist ~ speed, data = cars)
summary(all_fit)
```

```{r}
set.seed(5)
sub_cars = dplyr::slice_sample(cars, n = 3)
sub_fit = lm(dist ~ speed, data = sub_cars)
summary(sub_fit)
```

Hmmm. That's surprising, yes? Why during the summer months does our AUC go down. In fact, at times we are predicting the likelihood of **not** having an observation reported. It's hard to know what to think, but consider that we are using a model generated across all months of multiple years and it might not predict a particular month and year very well. A step toward refinement, our next step is to make 12 models, one for each month.
You can see that the `rU+00B2` value is quite high for the smaller data set, but the model may not be predictive over the full range of data. AUC is somewhat analogous to to `rU+00B2` in that a relatively low score does not necessarily suggest a poor model.

```{r}
ggplot2::ggplot(data = cars, ggplot2::aes(x = speed, y = dist)) +
ggplot2::geom_point(color = "blue") +
ggplot2::geom_abline(slope = coef(all_fit)[2], intercept = coef(all_fit)[1], color = "blue") +
ggplot2::geom_point(data = sub_cars, ggplot2::aes(x = speed, y = dist), color = "orange") +
ggplot2::geom_abline(slope = coef(sub_fit)[2], intercept = coef(sub_fit)[1], color = "orange")
```





Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified modeling-01_files/figure-html/unnamed-chunk-8-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
59 changes: 31 additions & 28 deletions modeling-02.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,9 @@ So the colder months have fewer observations than the warmer months. We already

### Build the monthly models

```{r}
Since we are building 12 models (rather than one) it is useful to create a function that computes a model for any month, and then iterate through the months of the year.

```{r}
# A function for making one month's model
#
# @param tbl a data frame of one month's observations
Expand Down Expand Up @@ -81,8 +82,13 @@ models = obs |>
rlang::set_names(levels(obs$month))
```

## Predict with rasters
We can look at the response plots for every month, but for demonstration purposes, we'll just show one month.

```{r}
plot(models[['Jun']], type = 'cloglog')
```

## Predict with rasters
First we load the raster databases as these are lightweight to pass into a function that iterates through the months.

### Load the raster databases (`sst` and `u_wind` and `v_wind`)
Expand Down Expand Up @@ -142,47 +148,44 @@ coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>
plot_coast = function() {
plot(coast, col = 'green', add = TRUE)
}
plot(pred_rasters |> st_to_180(), hook = plot_coast)
plot(pred_rasters, hook = plot_coast)
```

Let's see what we can discern from the predict abilities. We can extract the predicted values at the observed locations.
Let's see what we can discern from the predict abilities. We can extract the predicted values at the observed locations. Having those in hand allows us to compute pAUC for each month.



```{r}
pred_obs = stars::st_extract(pred_rasters,
dplyr::filter(obs, dplyr::between(date, dates[1], dates[2])),
time_column = "month_id") |>
sf::st_drop_geometry()
pred_bkg = stars::st_extract(pred_rasters,
dplyr::filter(bkg, dplyr::between(date, dates[1], dates[2])),
time_column = "month_id") |>
sf::st_drop_geometry()
preds = dplyr::bind_rows(pred_obs, pred_bkg) |>
dplyr::mutate(label = c(rep(1, nrow(pred_obs)), rep(0, nrow(pred_bkg))), .before = 1) |>
dplyr::mutate(month = factor(format(time, "%b"), levels = month.abb)) |>
dplyr::mutate(month = factor(format(month_id, "%b"), levels = month.abb)) |>
dplyr::group_by(month)
aucs = dplyr::group_map(preds,
paucs = dplyr::group_map(pred_obs,
function(x, y) {
dplyr::tibble(month = y$month, auc = maxnetic::AUC(x))
}) |>
ix = month.abb %in% y$month
s = dplyr::slice(pred_rasters, "time", ix)
pauc = maxnetic::pAUC(s,x)
dplyr::tibble(month = y$month,
auc = pauc$area,
pauc = list(pauc))
})|>
dplyr::bind_rows() |>
dplyr::right_join(counts, by = "month") |>
print(n=12)
print(n = 12)
```

OK, that's unexpected. The months with the lower counts of observations have relatively higher AUCs. Huh? Let's look at that graphically.
Note that last element, `pauc`, is the result returned by the `maxnetic::pAUC()` function which we can plot.

```{r}
aucs_long = tidyr::pivot_longer(aucs, dplyr::all_of(c("n_obs", "n_bkg")),
names_to = "type", values_to = "count") |>
dplyr::mutate(type = dplyr::recode(type, n_obs = "obs", n_bkg = "bkg"))
ggplot2::ggplot(data = aucs_long, aes(x = count, y = auc, color = type)) +
ggplot2::geom_point() +
ggplot2::geom_smooth(method='lm', formula= y~x)
pp = paucs |>
dplyr::group_by(month) |>
dplyr::group_map(
function(tbl, key){
plot(tbl$pauc[[1]], title = key$month, xlab = "", ylab = "")
}
)
patchwork::wrap_plots(pp, ncol = 4)
```
Surprised? Could this be overfitting resulting from sampling background in time weighted to the months when we have observations? Hmmmm.

Well, it would be easy to become dispirited by this result. It would be reasonable to expect AUC values to improve if we built monthly models rather than a single model applied to any month. But it seems to not be so. Darn!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified modeling-02_files/figure-html/unnamed-chunk-7-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified modeling-02_files/figure-html/unnamed-chunk-9-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b552644

Please sign in to comment.