diff --git a/covariates.qmd b/covariates.qmd index a1d3476..7e28539 100644 --- a/covariates.qmd +++ b/covariates.qmd @@ -1,6 +1,6 @@ --- title: "Covariates for observations and background" -cache: true +cache: false --- Here we do the multi-step task of associating observations with environmental covariates and creating a background point data set used by the model to characterize the environment. @@ -86,27 +86,17 @@ H = hist(obs$date, breaks = 'month', format = "%Y", freq = TRUE, main = "Observations", xlab = "Date") ``` -Embedded in the returned value, `H`, are the probability densities and dates for the start of each month. -```{r} -months = H$breaks + as.Date("1970-01-01") -probs = H$density -``` - -We can use weighted sampling so we are characterizing the environment consistent with the observations. First we make a time series that extends from the first to the last observation date. +We **could** use weighted sampling so we are characterizing the environment consistent with the observations. But the purpose of the sampling isn't to mimic the distrubution of observations in time, but instead to characterize the environment. So, instead we'll make an unweighted sample across the time range. First we make a time series that extends from the first to the last observation date **plus** a buffer of about 1 month. ```{r} -days = seq(from = min(obs$date), to = max(obs$date), by = "day") +n_buffer_days = 30 +days = seq(from = min(obs$date) - n_buffer_days, + to = max(obs$date) + n_buffer_days, + by = "day") ``` -Next we use the `months` and `days` to develop a look-up vector to assign a probability to each day. *Did you catch that?* The probability of selecting a given **day** depends upon the probability of an observation occurring in a given **month**. - -```{r} -index = findInterval(days, months) -day_probs = probs[index] -``` - -Now we can sample - **but how many**? Let's start by selecting approximately **twice** as many background points as we have observation points. If it is too many then we can subsample as needed, if it isn't enough we can come back an increase the number. In addition, we may lose some is the subsequent steps making a spatial sample. +Now we can sample - **but how many**? Let's start by selecting approximately **four times** as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn't enough we can come back an increase the number. In addition, we may lose some samples in the subsequent steps making a spatial sample. :::{.callout-note} Note that we set the random number generator seed. This isn't a requirement, but we use it here so that we get the same random selection each time we render the page. Here's a nice discussion about `set.seed()` [usage](https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function). @@ -114,15 +104,17 @@ Note that we set the random number generator seed. This isn't a requirement, but ```{r} set.seed(1234) -nback = nrow(obs) * 2 -days_sample = sample(days, size = nback, replace = TRUE, prob = day_probs) +nback = nrow(obs) * 4 +days_sample = sample(days, size = nback, replace = TRUE) ``` -So, now we have a sampling of of dates that have a temporal distribution similar to that of the observations. +Now we can plot the same histogram, but with the `days_sample` data. -:::{.callout-warning} -It is possible that we maybe [overfitting](https://en.wikipedia.org/wiki/Overfitting) by weighting the samples in time. Other time-sampling strategies are available to us, so we are not stuck with the approach we are using and can easily revisit this selection. -::: +```{r} +H = hist(days_sample, breaks = 'month', format = "%Y", + freq = TRUE, main = "Sample", + xlab = "Date") +``` ### Sampling space @@ -185,7 +177,7 @@ sf::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg")) #### Sampling the polygon -Now to sample the within the polygon, we'll sample the same number we selected earlier. +Now to sample the within the polygon, we'll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes). ```{r} set.seed(1234) diff --git a/covariates_files/figure-html/unnamed-chunk-11-1.png b/covariates_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 0000000..d94171e Binary files /dev/null and b/covariates_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/covariates_files/figure-html/unnamed-chunk-13-1.png b/covariates_files/figure-html/unnamed-chunk-13-1.png new file mode 100644 index 0000000..b691fa4 Binary files /dev/null and b/covariates_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/covariates_files/figure-html/unnamed-chunk-14-1.png b/covariates_files/figure-html/unnamed-chunk-14-1.png index 6eb957f..6a05a1b 100644 Binary files a/covariates_files/figure-html/unnamed-chunk-14-1.png and b/covariates_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/covariates_files/figure-html/unnamed-chunk-6-1.png b/covariates_files/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 0000000..9529325 Binary files /dev/null and b/covariates_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/covariates_files/figure-html/unnamed-chunk-7-1.png b/covariates_files/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 0000000..d1e6a17 Binary files /dev/null and b/covariates_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/covariates_files/figure-html/unnamed-chunk-8-1.png b/covariates_files/figure-html/unnamed-chunk-8-1.png index d1e6a17..2febb0c 100644 Binary files a/covariates_files/figure-html/unnamed-chunk-8-1.png and b/covariates_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/covariates_files/figure-html/unnamed-chunk-9-1.png b/covariates_files/figure-html/unnamed-chunk-9-1.png index 2febb0c..cd36256 100644 Binary files a/covariates_files/figure-html/unnamed-chunk-9-1.png and b/covariates_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/data/bkg/bkg-covariates.gpkg b/data/bkg/bkg-covariates.gpkg index 4ce6523..d4dc68f 100644 Binary files a/data/bkg/bkg-covariates.gpkg and b/data/bkg/bkg-covariates.gpkg differ diff --git a/data/bkg/buffered-polygon.gpkg b/data/bkg/buffered-polygon.gpkg index ddbb934..18840da 100644 Binary files a/data/bkg/buffered-polygon.gpkg and b/data/bkg/buffered-polygon.gpkg differ diff --git a/data/obs/obs-covariates.gpkg b/data/obs/obs-covariates.gpkg index 11af3df..8dd2935 100644 Binary files a/data/obs/obs-covariates.gpkg and b/data/obs/obs-covariates.gpkg differ diff --git a/docs/covariates.html b/docs/covariates.html index 2906586..e64817b 100644 --- a/docs/covariates.html +++ b/docs/covariates.html @@ -255,7 +255,7 @@

2.1 Load the observation data

We’ll load in our OBIS observations as a flat table, and immediately filter the data to any occurring from 2000 to present.

-
+
source("setup.R")
 
 obs = read_obis(form = "sf") |>
@@ -278,7 +278,7 @@ 

2.2 Load the environmental predictors

Next we load the environmental predictors, sst and wind (as windspeed, u_wind and v_wind). For each we first read in the database, then compose filenames, read in the rasters and organize into layers by date, then rename to a pretty-name and finally shift the rasters from their native 0-to-360 longitude range to longitude -180-to-180 which matches our observations.

-
+
sst_path = "data/oisst"
 sst_db = oisster::read_database(sst_path) |>
   dplyr::arrange(date)
@@ -327,7 +327,7 @@ 

<

3.1 Sampling time

Sampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.

-
+
H = hist(obs$date, breaks = 'month', format = "%Y", 
      freq = TRUE, main = "Observations",
      xlab = "Date")
@@ -335,21 +335,14 @@

-

Embedded in the returned value, H, are the probability densities and dates for the start of each month.

-
-
months = H$breaks + as.Date("1970-01-01")
-probs = H$density
+

We could use weighted sampling so we are characterizing the environment consistent with the observations. But the purpose of the sampling isn’t to mimic the distrubution of observations in time, but instead to characterize the environment. So, instead we’ll make an unweighted sample across the time range. First we make a time series that extends from the first to the last observation date plus a buffer of about 1 month.

+
+
n_buffer_days = 30
+days = seq(from = min(obs$date) - n_buffer_days, 
+           to = max(obs$date) + n_buffer_days, 
+           by = "day")
-

We can use weighted sampling so we are characterizing the environment consistent with the observations. First we make a time series that extends from the first to the last observation date.

-
-
days = seq(from = min(obs$date), to = max(obs$date), by = "day")
-
-

Next we use the months and days to develop a look-up vector to assign a probability to each day. Did you catch that? The probability of selecting a given day depends upon the probability of an observation occurring in a given month.

-
-
index = findInterval(days, months)
-day_probs = probs[index]
-
-

Now we can sample - but how many? Let’s start by selecting approximately twice as many background points as we have observation points. If it is too many then we can subsample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some is the subsequent steps making a spatial sample.

+

Now we can sample - but how many? Let’s start by selecting approximately four times as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some samples in the subsequent steps making a spatial sample.

@@ -363,23 +356,18 @@

Note that we set the random number generator seed. This isn’t a requirement, but we use it here so that we get the same random selection each time we render the page. Here’s a nice discussion about set.seed() usage.

-
-
set.seed(1234)
-nback = nrow(obs) * 2
-days_sample = sample(days, size = nback, replace = TRUE, prob = day_probs)
-
-

So, now we have a sampling of of dates that have a temporal distribution similar to that of the observations.

-
-
-
- -
-
-Warning -
+
+
set.seed(1234)
+nback = nrow(obs) * 4
+days_sample = sample(days, size = nback, replace = TRUE)
-
-

It is possible that we maybe overfitting by weighting the samples in time. Other time-sampling strategies are available to us, so we are not stuck with the approach we are using and can easily revisit this selection.

+

Now we can plot the same histogram, but with the days_sample data.

+
+
H = hist(days_sample, breaks = 'month', format = "%Y", 
+     freq = TRUE, main = "Sample",
+     xlab = "Date")
+
+

@@ -389,17 +377,17 @@

3.2.1 The bounding box polygon

This is the easiest of the three polygons to make.

-
-
coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')
-
-box = sf::st_bbox(obs) |>
-  sf::st_as_sfc()
-
-plot(sf::st_geometry(coast), extent = box, axes = TRUE)
-plot(box, lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
+
+
coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')
+
+box = sf::st_bbox(obs) |>
+  sf::st_as_sfc()
+
+plot(sf::st_geometry(coast), extent = box, axes = TRUE)
+plot(box, lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
-

+

Hmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.

@@ -407,15 +395,15 @@

3.2.2 The convex hull polygon

Also an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.

-
-
chull = sf::st_union(obs) |>
-  sf::st_convex_hull()
-
-plot(sf::st_geometry(coast), extent = chull, axes = TRUE)
-plot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
+
+
chull = sf::st_union(obs) |>
+  sf::st_convex_hull()
+
+plot(sf::st_geometry(coast), extent = chull, axes = TRUE)
+plot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
-

+

Well, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.

@@ -423,40 +411,40 @@

3.2.3 The buffered polygon

An alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.

-
-
poly =  sf::st_union(obs) |>
-  sf::st_buffer(dist = 75000) |>
-  sf::st_union() |>
-  sf::st_simplify() |>
-  smoothr::smooth(method = 'chaikin', refinements = 10L)
-
-
-plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
+
+
poly =  sf::st_union(obs) |>
+  sf::st_buffer(dist = 75000) |>
+  sf::st_union() |>
+  sf::st_simplify() |>
+  smoothr::smooth(method = 'chaikin', refinements = 10L)
+
+
+plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
-

+

That seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.

-
-
ok = dir.create("data/bkg", recursive = TRUE, showWarnings = FALSE)
-sf::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg"))
+
+
ok = dir.create("data/bkg", recursive = TRUE, showWarnings = FALSE)
+sf::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg"))

3.2.4 Sampling the polygon

-

Now to sample the within the polygon, we’ll sample the same number we selected earlier.

-
-
set.seed(1234)
-bkg = sf::st_sample(poly, nback) 
-
-plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
-plot(sf::st_geometry(coast), add = TRUE)
+

Now to sample the within the polygon, we’ll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes).

+
+
set.seed(1234)
+bkg = sf::st_sample(poly, nback) 
+
+plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
+plot(sf::st_geometry(coast), add = TRUE)
-

+

OK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.

@@ -464,42 +452,42 @@

<

3.2.5 Purging points that are on land (or very nearshore)

It’s great if you have in hand a map the distinguishes between land and sea - like we do with sst. We shall extract values v from just the first sst layer (hence the slice).

-
-
v = sst |>
-  dplyr::slice(along = "time", 1) |>
-  stars::st_extract(bkg) |>
-  sf::st_as_sf() |>
-  dplyr::mutate(is_water = !is.na(sst), .before = 1) |>
-  dplyr::glimpse()
+
+
v = sst |>
+  dplyr::slice(along = "time", 1) |>
+  stars::st_extract(bkg) |>
+  sf::st_as_sf() |>
+  dplyr::mutate(is_water = !is.na(sst), .before = 1) |>
+  dplyr::glimpse()
-
Rows: 17,122
+
Rows: 34,244
 Columns: 3
-$ is_water <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,…
-$ sst      <dbl> 14.506774, NA, 5.303548, 6.633548, 2.418387, 4.907097, 15.954…
-$ geometry <POINT [°]> POINT (-66.82782 39.91825), POINT (-64.0782 45.61369), …
+$ is_water <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE,… +$ sst <dbl> 12.848709, 5.922903, 8.960322, 8.053871, NA, 3.865484, NA, 15… +$ geometry <POINT [°]> POINT (-65.47837 40.75635), POINT (-65.71588 42.76753),…

Values where sst are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg object and random dates (days_sample), filter to include only water.

-
-
bkg = sf::st_as_sf(bkg) |>
-  sf::st_set_geometry("geometry") |>
-  dplyr::mutate(date = days_sample, .before = 1) |>
-  dplyr::filter(v$is_water)
-
-plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
+
+
bkg = sf::st_as_sf(bkg) |>
+  sf::st_set_geometry("geometry") |>
+  dplyr::mutate(date = days_sample, .before = 1) |>
+  dplyr::filter(v$is_water)
+
+plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
-

+

Note that the bottom of the scatter is cut off. That tells us that the sst raster has been cropped to that southern limit. We can confirm that easily.

-
-
plot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(bkg), pch = ".", col = "blue", add = TRUE)
+
+
plot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(bkg), pch = ".", col = "blue", add = TRUE)
-

+

@@ -510,8 +498,8 @@

4.1 Wait, what about dates?

You may have considered already an issue connecting our background points which have daily dates with our covariates which are monthly (identified by the first of each month.) We can manage that by adding a second date, month_id, to the bkg table. We’ll use the current_month() function from the oisster package to compute that.

-
-
bkg = dplyr::mutate(bkg, month_id = oisster::current_month(date))
+
+
bkg = dplyr::mutate(bkg, month_id = oisster::current_month(date))
@@ -519,52 +507,52 @@

4.2.1 Extract sst

Now we need to read in the complete sst data. We have already read in all of the sst data, and then transform the longitude coordinates to the -180 to 180 range. Then we extract, specifying which variable in bkg is mapped to the time domain in sst - in our case the newly computed month_id matches the time dimension in sst.

-
-
sst_values = stars::st_extract(sst, bkg, time_column = 'month_id')
+
+
sst_values = stars::st_extract(sst, bkg, time_column = 'month_id')

4.2.2 Extract wind

For wind we have three parameters (windspeed, u_wind and v_wind). Presumably windspeed is a function of u_wind and v_wind and will be correlated with them. Nonetheless, for completeness, we’ll extract all three.

-
-
windspeed_values = stars::st_extract(windspeed, bkg, time_column = "month_id")
-
-u_wind_values = stars::st_extract(u_wind, bkg, time_column = "month_id")
-
-v_wind_values = stars::st_extract(v_wind, bkg, time_column = "month_id")
+
+
windspeed_values = stars::st_extract(windspeed, bkg, time_column = "month_id")
+
+u_wind_values = stars::st_extract(u_wind, bkg, time_column = "month_id")
+
+v_wind_values = stars::st_extract(v_wind, bkg, time_column = "month_id")

4.3 Now put them together and save

Now we merge the three extractions for sst, u_wind and v_wind into one object, and then save to disk for later retrieval. It might be tempting to use merge() or one of dplyr’s join functions, but we really have an easy task as all of our extractions have the same number of records in the same order. We need only mutate bkg to include each of the extracted values.

-
-
bkg = bkg |>
-  dplyr::mutate(
-    sst = sst_values$sst,
-    windspeed = windspeed_values$windspeed,
-    u_wind = u_wind_values$u_wind,
-    v_wind = v_wind_values$v_wind) |>
-  sf::write_sf(file.path("data", "bkg", "bkg-covariates.gpkg"))
+
+
bkg = bkg |>
+  dplyr::mutate(
+    sst = sst_values$sst,
+    windspeed = windspeed_values$windspeed,
+    u_wind = u_wind_values$u_wind,
+    v_wind = v_wind_values$v_wind) |>
+  sf::write_sf(file.path("data", "bkg", "bkg-covariates.gpkg"))

4.4 Next extract observation points

It’s the same workflow to extract covariates for the observations as it was for the background, but let’s not forget to add in a variable to identify the month that matches those in the predictors.

-
-
obs = dplyr::mutate(obs, month_id = oisster::current_month(date))
-sst_values = stars::st_extract(sst, obs, time_column = 'month_id')
-windspeed_values = stars::st_extract(windspeed, obs, time_column = "month_id")
-u_wind_values = stars::st_extract(u_wind, obs, time_column = "month_id")
-v_wind_values = stars::st_extract(v_wind, obs, time_column = "month_id")
-
-obs = obs |>
-  dplyr::mutate(
-    sst = sst_values$sst,
-    windspeed = windspeed_values$windspeed,
-    u_wind = u_wind_values$u_wind,
-    v_wind = v_wind_values$v_wind) |>
-  sf::write_sf(file.path("data", "obs", "obs-covariates.gpkg"))
+
+
obs = dplyr::mutate(obs, month_id = oisster::current_month(date))
+sst_values = stars::st_extract(sst, obs, time_column = 'month_id')
+windspeed_values = stars::st_extract(windspeed, obs, time_column = "month_id")
+u_wind_values = stars::st_extract(u_wind, obs, time_column = "month_id")
+v_wind_values = stars::st_extract(v_wind, obs, time_column = "month_id")
+
+obs = obs |>
+  dplyr::mutate(
+    sst = sst_values$sst,
+    windspeed = windspeed_values$windspeed,
+    u_wind = u_wind_values$u_wind,
+    v_wind = v_wind_values$v_wind) |>
+  sf::write_sf(file.path("data", "obs", "obs-covariates.gpkg"))

That’s it! Next we can start assembling a model.

diff --git a/docs/covariates_files/figure-html/unnamed-chunk-11-1.png b/docs/covariates_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 0000000..d94171e Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/docs/covariates_files/figure-html/unnamed-chunk-13-1.png b/docs/covariates_files/figure-html/unnamed-chunk-13-1.png new file mode 100644 index 0000000..b691fa4 Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/docs/covariates_files/figure-html/unnamed-chunk-14-1.png b/docs/covariates_files/figure-html/unnamed-chunk-14-1.png index 6eb957f..6a05a1b 100644 Binary files a/docs/covariates_files/figure-html/unnamed-chunk-14-1.png and b/docs/covariates_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/docs/covariates_files/figure-html/unnamed-chunk-6-1.png b/docs/covariates_files/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 0000000..9529325 Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/docs/covariates_files/figure-html/unnamed-chunk-7-1.png b/docs/covariates_files/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 0000000..d1e6a17 Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/covariates_files/figure-html/unnamed-chunk-8-1.png b/docs/covariates_files/figure-html/unnamed-chunk-8-1.png index d1e6a17..2febb0c 100644 Binary files a/docs/covariates_files/figure-html/unnamed-chunk-8-1.png and b/docs/covariates_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/covariates_files/figure-html/unnamed-chunk-9-1.png b/docs/covariates_files/figure-html/unnamed-chunk-9-1.png index 2febb0c..cd36256 100644 Binary files a/docs/covariates_files/figure-html/unnamed-chunk-9-1.png and b/docs/covariates_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/search.json b/docs/search.json index cbd7a91..03bb979 100644 --- a/docs/search.json +++ b/docs/search.json @@ -172,7 +172,7 @@ "href": "covariates.html#sampling-background-data", "title": "Covariates for observations and background", "section": "3 Sampling background data", - "text": "3 Sampling background data\nWe need to create a random sample of background in both time and space.\n\n3.1 Sampling time\nSampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.\n\nH = hist(obs$date, breaks = 'month', format = \"%Y\", \n freq = TRUE, main = \"Observations\",\n xlab = \"Date\")\n\n\n\n\nEmbedded in the returned value, H, are the probability densities and dates for the start of each month.\n\nmonths = H$breaks + as.Date(\"1970-01-01\")\nprobs = H$density\n\nWe can use weighted sampling so we are characterizing the environment consistent with the observations. First we make a time series that extends from the first to the last observation date.\n\ndays = seq(from = min(obs$date), to = max(obs$date), by = \"day\")\n\nNext we use the months and days to develop a look-up vector to assign a probability to each day. Did you catch that? The probability of selecting a given day depends upon the probability of an observation occurring in a given month.\n\nindex = findInterval(days, months)\nday_probs = probs[index]\n\nNow we can sample - but how many? Let’s start by selecting approximately twice as many background points as we have observation points. If it is too many then we can subsample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some is the subsequent steps making a spatial sample.\n\n\n\n\n\n\nNote\n\n\n\nNote that we set the random number generator seed. This isn’t a requirement, but we use it here so that we get the same random selection each time we render the page. Here’s a nice discussion about set.seed() usage.\n\n\n\nset.seed(1234)\nnback = nrow(obs) * 2\ndays_sample = sample(days, size = nback, replace = TRUE, prob = day_probs)\n\nSo, now we have a sampling of of dates that have a temporal distribution similar to that of the observations.\n\n\n\n\n\n\nWarning\n\n\n\nIt is possible that we maybe overfitting by weighting the samples in time. Other time-sampling strategies are available to us, so we are not stuck with the approach we are using and can easily revisit this selection.\n\n\n\n\n3.2 Sampling space\nThe sf package provides a function, st_sample(), for sampling points within a polygon. But what polygon? We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it’s advantages and disadvantages. We show how to make one of each.\n\n3.2.1 The bounding box polygon\nThis is the easiest of the three polygons to make.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')\n\nbox = sf::st_bbox(obs) |>\n sf::st_as_sfc()\n\nplot(sf::st_geometry(coast), extent = box, axes = TRUE)\nplot(box, lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nHmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.\n\n\n3.2.2 The convex hull polygon\nAlso an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.\n\nchull = sf::st_union(obs) |>\n sf::st_convex_hull()\n\nplot(sf::st_geometry(coast), extent = chull, axes = TRUE)\nplot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nWell, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.\n\n\n3.2.3 The buffered polygon\nAn alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.\n\npoly = sf::st_union(obs) |>\n sf::st_buffer(dist = 75000) |>\n sf::st_union() |>\n sf::st_simplify() |>\n smoothr::smooth(method = 'chaikin', refinements = 10L)\n\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nThat seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.\n\nok = dir.create(\"data/bkg\", recursive = TRUE, showWarnings = FALSE)\nsf::write_sf(poly, file.path(\"data\", \"bkg\", \"buffered-polygon.gpkg\"))\n\n\n\n3.2.4 Sampling the polygon\nNow to sample the within the polygon, we’ll sample the same number we selected earlier.\n\nset.seed(1234)\nbkg = sf::st_sample(poly, nback) \n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\nplot(sf::st_geometry(coast), add = TRUE)\n\n\n\n\nOK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.\n\n\n3.2.5 Purging points that are on land (or very nearshore)\nIt’s great if you have in hand a map the distinguishes between land and sea - like we do with sst. We shall extract values v from just the first sst layer (hence the slice).\n\nv = sst |>\n dplyr::slice(along = \"time\", 1) |>\n stars::st_extract(bkg) |>\n sf::st_as_sf() |>\n dplyr::mutate(is_water = !is.na(sst), .before = 1) |>\n dplyr::glimpse()\n\nRows: 17,122\nColumns: 3\n$ is_water <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,…\n$ sst <dbl> 14.506774, NA, 5.303548, 6.633548, 2.418387, 4.907097, 15.954…\n$ geometry <POINT [°]> POINT (-66.82782 39.91825), POINT (-64.0782 45.61369), …\n\n\nValues where sst are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg object and random dates (days_sample), filter to include only water.\n\nbkg = sf::st_as_sf(bkg) |>\n sf::st_set_geometry(\"geometry\") |>\n dplyr::mutate(date = days_sample, .before = 1) |>\n dplyr::filter(v$is_water)\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\n\n\n\n\nNote that the bottom of the scatter is cut off. That tells us that the sst raster has been cropped to that southern limit. We can confirm that easily.\n\nplot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = \"blue\", add = TRUE)" + "text": "3 Sampling background data\nWe need to create a random sample of background in both time and space.\n\n3.1 Sampling time\nSampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.\n\nH = hist(obs$date, breaks = 'month', format = \"%Y\", \n freq = TRUE, main = \"Observations\",\n xlab = \"Date\")\n\n\n\n\nWe could use weighted sampling so we are characterizing the environment consistent with the observations. But the purpose of the sampling isn’t to mimic the distrubution of observations in time, but instead to characterize the environment. So, instead we’ll make an unweighted sample across the time range. First we make a time series that extends from the first to the last observation date plus a buffer of about 1 month.\n\nn_buffer_days = 30\ndays = seq(from = min(obs$date) - n_buffer_days, \n to = max(obs$date) + n_buffer_days, \n by = \"day\")\n\nNow we can sample - but how many? Let’s start by selecting approximately four times as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some samples in the subsequent steps making a spatial sample.\n\n\n\n\n\n\nNote\n\n\n\nNote that we set the random number generator seed. This isn’t a requirement, but we use it here so that we get the same random selection each time we render the page. Here’s a nice discussion about set.seed() usage.\n\n\n\nset.seed(1234)\nnback = nrow(obs) * 4\ndays_sample = sample(days, size = nback, replace = TRUE)\n\nNow we can plot the same histogram, but with the days_sample data.\n\nH = hist(days_sample, breaks = 'month', format = \"%Y\", \n freq = TRUE, main = \"Sample\",\n xlab = \"Date\")\n\n\n\n\n\n\n3.2 Sampling space\nThe sf package provides a function, st_sample(), for sampling points within a polygon. But what polygon? We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it’s advantages and disadvantages. We show how to make one of each.\n\n3.2.1 The bounding box polygon\nThis is the easiest of the three polygons to make.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')\n\nbox = sf::st_bbox(obs) |>\n sf::st_as_sfc()\n\nplot(sf::st_geometry(coast), extent = box, axes = TRUE)\nplot(box, lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nHmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.\n\n\n3.2.2 The convex hull polygon\nAlso an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.\n\nchull = sf::st_union(obs) |>\n sf::st_convex_hull()\n\nplot(sf::st_geometry(coast), extent = chull, axes = TRUE)\nplot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nWell, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.\n\n\n3.2.3 The buffered polygon\nAn alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.\n\npoly = sf::st_union(obs) |>\n sf::st_buffer(dist = 75000) |>\n sf::st_union() |>\n sf::st_simplify() |>\n smoothr::smooth(method = 'chaikin', refinements = 10L)\n\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nThat seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.\n\nok = dir.create(\"data/bkg\", recursive = TRUE, showWarnings = FALSE)\nsf::write_sf(poly, file.path(\"data\", \"bkg\", \"buffered-polygon.gpkg\"))\n\n\n\n3.2.4 Sampling the polygon\nNow to sample the within the polygon, we’ll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes).\n\nset.seed(1234)\nbkg = sf::st_sample(poly, nback) \n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\nplot(sf::st_geometry(coast), add = TRUE)\n\n\n\n\nOK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.\n\n\n3.2.5 Purging points that are on land (or very nearshore)\nIt’s great if you have in hand a map the distinguishes between land and sea - like we do with sst. We shall extract values v from just the first sst layer (hence the slice).\n\nv = sst |>\n dplyr::slice(along = \"time\", 1) |>\n stars::st_extract(bkg) |>\n sf::st_as_sf() |>\n dplyr::mutate(is_water = !is.na(sst), .before = 1) |>\n dplyr::glimpse()\n\nRows: 34,244\nColumns: 3\n$ is_water <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE,…\n$ sst <dbl> 12.848709, 5.922903, 8.960322, 8.053871, NA, 3.865484, NA, 15…\n$ geometry <POINT [°]> POINT (-65.47837 40.75635), POINT (-65.71588 42.76753),…\n\n\nValues where sst are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg object and random dates (days_sample), filter to include only water.\n\nbkg = sf::st_as_sf(bkg) |>\n sf::st_set_geometry(\"geometry\") |>\n dplyr::mutate(date = days_sample, .before = 1) |>\n dplyr::filter(v$is_water)\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\n\n\n\n\nNote that the bottom of the scatter is cut off. That tells us that the sst raster has been cropped to that southern limit. We can confirm that easily.\n\nplot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = \"blue\", add = TRUE)" }, { "objectID": "covariates.html#extract-environmental-covariates-for-sst-and-wind",