source("setup.R")
= read_obis(form = "sf") |>
@@ -278,7 +278,7 @@ obs
2.2 Load the environmental predictors
Next we load the environmental predictors, sst
and wind
(as windspeed
, u_wind
and v_wind
). For each we first read in the database, then compose filenames, read in the rasters and organize into layers by date, then rename to a pretty-name and finally shift the rasters from their native 0-to-360 longitude range to longitude -180-to-180 which matches our observations.
-
+
= "data/oisst"
sst_path = oisster::read_database(sst_path) |>
sst_db ::arrange(date)
@@ -327,7 +327,7 @@ dplyr<
3.1 Sampling time
Sampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.
-
+
= hist(obs$date, breaks = 'month', format = "%Y",
H freq = TRUE, main = "Observations",
xlab = "Date")
@@ -335,21 +335,14 @@
-Embedded in the returned value, H
, are the probability densities and dates for the start of each month.
-
-= H$breaks + as.Date("1970-01-01")
- months = H$density probs
+We could use weighted sampling so we are characterizing the environment consistent with the observations. But the purpose of the sampling isn’t to mimic the distrubution of observations in time, but instead to characterize the environment. So, instead we’ll make an unweighted sample across the time range. First we make a time series that extends from the first to the last observation date plus a buffer of about 1 month.
+
+= 30
+ n_buffer_days = seq(from = min(obs$date) - n_buffer_days,
+ days to = max(obs$date) + n_buffer_days,
+ by = "day")
-We can use weighted sampling so we are characterizing the environment consistent with the observations. First we make a time series that extends from the first to the last observation date.
-
-= seq(from = min(obs$date), to = max(obs$date), by = "day") days
-
-Next we use the months
and days
to develop a look-up vector to assign a probability to each day. Did you catch that? The probability of selecting a given day depends upon the probability of an observation occurring in a given month.
-
-= findInterval(days, months)
- index = probs[index] day_probs
-
-Now we can sample - but how many? Let’s start by selecting approximately twice as many background points as we have observation points. If it is too many then we can subsample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some is the subsequent steps making a spatial sample.
+Now we can sample - but how many? Let’s start by selecting approximately four times as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some samples in the subsequent steps making a spatial sample.
-
-set.seed(1234)
-= nrow(obs) * 2
- nback = sample(days, size = nback, replace = TRUE, prob = day_probs) days_sample
-
-So, now we have a sampling of of dates that have a temporal distribution similar to that of the observations.
-
-
-
-Warning
-
+
+set.seed(1234)
+= nrow(obs) * 4
+ nback = sample(days, size = nback, replace = TRUE) days_sample
-
-It is possible that we maybe overfitting by weighting the samples in time. Other time-sampling strategies are available to us, so we are not stuck with the approach we are using and can easily revisit this selection.
+Now we can plot the same histogram, but with the days_sample
data.
+
+= hist(days_sample, breaks = 'month', format = "%Y",
+ H freq = TRUE, main = "Sample",
+ xlab = "Date")
+
+
@@ -389,17 +377,17 @@
3.2.1 The bounding box polygon
This is the easiest of the three polygons to make.
-
-= rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')
- coast
-= sf::st_bbox(obs) |>
- box ::st_as_sfc()
- sf
-plot(sf::st_geometry(coast), extent = box, axes = TRUE)
-plot(box, lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
+
+= rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')
+ coast
+= sf::st_bbox(obs) |>
+ box ::st_as_sfc()
+ sf
+plot(sf::st_geometry(coast), extent = box, axes = TRUE)
+plot(box, lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
-
+
Hmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.
@@ -407,15 +395,15 @@
3.2.2 The convex hull polygon
Also an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.
-
-= sf::st_union(obs) |>
- chull ::st_convex_hull()
- sf
-plot(sf::st_geometry(coast), extent = chull, axes = TRUE)
-plot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
+
+= sf::st_union(obs) |>
+ chull ::st_convex_hull()
+ sf
+plot(sf::st_geometry(coast), extent = chull, axes = TRUE)
+plot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
-
+
Well, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.
@@ -423,40 +411,40 @@
3.2.3 The buffered polygon
An alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.
-
-= sf::st_union(obs) |>
- poly ::st_buffer(dist = 75000) |>
- sf::st_union() |>
- sf::st_simplify() |>
- sf::smooth(method = 'chaikin', refinements = 10L)
- smoothr
-
-plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
+
+= sf::st_union(obs) |>
+ poly ::st_buffer(dist = 75000) |>
+ sf::st_union() |>
+ sf::st_simplify() |>
+ sf::smooth(method = 'chaikin', refinements = 10L)
+ smoothr
+
+plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
-
+
That seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.
-
-= dir.create("data/bkg", recursive = TRUE, showWarnings = FALSE)
- ok ::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg")) sf
+
+= dir.create("data/bkg", recursive = TRUE, showWarnings = FALSE)
+ ok ::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg")) sf
3.2.4 Sampling the polygon
-Now to sample the within the polygon, we’ll sample the same number we selected earlier.
-
-set.seed(1234)
-= sf::st_sample(poly, nback)
- bkg
-plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
-plot(sf::st_geometry(coast), add = TRUE)
+Now to sample the within the polygon, we’ll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes).
+
+set.seed(1234)
+= sf::st_sample(poly, nback)
+ bkg
+plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
+plot(sf::st_geometry(coast), add = TRUE)
-
+
OK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.
@@ -464,42 +452,42 @@ <
3.2.5 Purging points that are on land (or very nearshore)
It’s great if you have in hand a map the distinguishes between land and sea - like we do with sst
. We shall extract values v
from just the first sst
layer (hence the slice).
-
-= sst |>
- v ::slice(along = "time", 1) |>
- dplyr::st_extract(bkg) |>
- stars::st_as_sf() |>
- sf::mutate(is_water = !is.na(sst), .before = 1) |>
- dplyr::glimpse() dplyr
+
+= sst |>
+ v ::slice(along = "time", 1) |>
+ dplyr::st_extract(bkg) |>
+ stars::st_as_sf() |>
+ sf::mutate(is_water = !is.na(sst), .before = 1) |>
+ dplyr::glimpse() dplyr
-Rows: 17,122
+Rows: 34,244
Columns: 3
-$ is_water <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,…
-$ sst <dbl> 14.506774, NA, 5.303548, 6.633548, 2.418387, 4.907097, 15.954…
-$ geometry <POINT [°]> POINT (-66.82782 39.91825), POINT (-64.0782 45.61369), …
+$ is_water <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE,…
+$ sst <dbl> 12.848709, 5.922903, 8.960322, 8.053871, NA, 3.865484, NA, 15…
+$ geometry <POINT [°]> POINT (-65.47837 40.75635), POINT (-65.71588 42.76753),…
Values where sst
are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg
object and random dates (days_sample
), filter to include only water.
-
-= sf::st_as_sf(bkg) |>
- bkg ::st_set_geometry("geometry") |>
- sf::mutate(date = days_sample, .before = 1) |>
- dplyr::filter(v$is_water)
- dplyr
-plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
+
+= sf::st_as_sf(bkg) |>
+ bkg ::st_set_geometry("geometry") |>
+ sf::mutate(date = days_sample, .before = 1) |>
+ dplyr::filter(v$is_water)
+ dplyr
+plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
-
+
Note that the bottom of the scatter is cut off. That tells us that the sst
raster has been cropped to that southern limit. We can confirm that easily.
-
-plot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)
-plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
-plot(sf::st_geometry(bkg), pch = ".", col = "blue", add = TRUE)
+
+plot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)
+plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
+plot(sf::st_geometry(bkg), pch = ".", col = "blue", add = TRUE)
-
+
@@ -510,8 +498,8 @@
4.1 Wait, what about dates?
You may have considered already an issue connecting our background points which have daily dates with our covariates which are monthly (identified by the first of each month.) We can manage that by adding a second date, month_id
, to the bkg
table. We’ll use the current_month()
function from the oisster package to compute that.
-
-= dplyr::mutate(bkg, month_id = oisster::current_month(date)) bkg
+
+= dplyr::mutate(bkg, month_id = oisster::current_month(date)) bkg
@@ -519,52 +507,52 @@
4.2.1 Extract sst
Now we need to read in the complete sst
data. We have already read in all of the sst
data, and then transform the longitude coordinates to the -180 to 180 range. Then we extract, specifying which variable in bkg
is mapped to the time domain in sst
- in our case the newly computed month_id
matches the time
dimension in sst
.
-
-= stars::st_extract(sst, bkg, time_column = 'month_id') sst_values
+
+= stars::st_extract(sst, bkg, time_column = 'month_id') sst_values
4.2.2 Extract wind
For wind we have three parameters (windspeed
, u_wind
and v_wind
). Presumably windspeed
is a function of u_wind
and v_wind
and will be correlated with them. Nonetheless, for completeness, we’ll extract all three.
-
-= stars::st_extract(windspeed, bkg, time_column = "month_id")
- windspeed_values
-= stars::st_extract(u_wind, bkg, time_column = "month_id")
- u_wind_values
-= stars::st_extract(v_wind, bkg, time_column = "month_id") v_wind_values
+
+= stars::st_extract(windspeed, bkg, time_column = "month_id")
+ windspeed_values
+= stars::st_extract(u_wind, bkg, time_column = "month_id")
+ u_wind_values
+= stars::st_extract(v_wind, bkg, time_column = "month_id") v_wind_values
4.3 Now put them together and save
Now we merge the three extractions for sst
, u_wind
and v_wind
into one object, and then save to disk for later retrieval. It might be tempting to use merge()
or one of dplyr’s join functions, but we really have an easy task as all of our extractions have the same number of records in the same order. We need only mutate bkg
to include each of the extracted values.
-
-= bkg |>
- bkg ::mutate(
- dplyrsst = sst_values$sst,
- windspeed = windspeed_values$windspeed,
- u_wind = u_wind_values$u_wind,
- v_wind = v_wind_values$v_wind) |>
- ::write_sf(file.path("data", "bkg", "bkg-covariates.gpkg")) sf
+
+= bkg |>
+ bkg ::mutate(
+ dplyrsst = sst_values$sst,
+ windspeed = windspeed_values$windspeed,
+ u_wind = u_wind_values$u_wind,
+ v_wind = v_wind_values$v_wind) |>
+ ::write_sf(file.path("data", "bkg", "bkg-covariates.gpkg")) sf
4.4 Next extract observation points
It’s the same workflow to extract covariates for the observations as it was for the background, but let’s not forget to add in a variable to identify the month that matches those in the predictors.
-
-= dplyr::mutate(obs, month_id = oisster::current_month(date))
- obs = stars::st_extract(sst, obs, time_column = 'month_id')
- sst_values = stars::st_extract(windspeed, obs, time_column = "month_id")
- windspeed_values = stars::st_extract(u_wind, obs, time_column = "month_id")
- u_wind_values = stars::st_extract(v_wind, obs, time_column = "month_id")
- v_wind_values
-= obs |>
- obs ::mutate(
- dplyrsst = sst_values$sst,
- windspeed = windspeed_values$windspeed,
- u_wind = u_wind_values$u_wind,
- v_wind = v_wind_values$v_wind) |>
- ::write_sf(file.path("data", "obs", "obs-covariates.gpkg")) sf
+
+= dplyr::mutate(obs, month_id = oisster::current_month(date))
+ obs = stars::st_extract(sst, obs, time_column = 'month_id')
+ sst_values = stars::st_extract(windspeed, obs, time_column = "month_id")
+ windspeed_values = stars::st_extract(u_wind, obs, time_column = "month_id")
+ u_wind_values = stars::st_extract(v_wind, obs, time_column = "month_id")
+ v_wind_values
+= obs |>
+ obs ::mutate(
+ dplyrsst = sst_values$sst,
+ windspeed = windspeed_values$windspeed,
+ u_wind = u_wind_values$u_wind,
+ v_wind = v_wind_values$v_wind) |>
+ ::write_sf(file.path("data", "obs", "obs-covariates.gpkg")) sf
That’s it! Next we can start assembling a model.
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-11-1.png b/docs/covariates_files/figure-html/unnamed-chunk-11-1.png
new file mode 100644
index 0000000..d94171e
Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-11-1.png differ
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-13-1.png b/docs/covariates_files/figure-html/unnamed-chunk-13-1.png
new file mode 100644
index 0000000..b691fa4
Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-13-1.png differ
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-14-1.png b/docs/covariates_files/figure-html/unnamed-chunk-14-1.png
index 6eb957f..6a05a1b 100644
Binary files a/docs/covariates_files/figure-html/unnamed-chunk-14-1.png and b/docs/covariates_files/figure-html/unnamed-chunk-14-1.png differ
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-6-1.png b/docs/covariates_files/figure-html/unnamed-chunk-6-1.png
new file mode 100644
index 0000000..9529325
Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-6-1.png differ
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-7-1.png b/docs/covariates_files/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 0000000..d1e6a17
Binary files /dev/null and b/docs/covariates_files/figure-html/unnamed-chunk-7-1.png differ
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-8-1.png b/docs/covariates_files/figure-html/unnamed-chunk-8-1.png
index d1e6a17..2febb0c 100644
Binary files a/docs/covariates_files/figure-html/unnamed-chunk-8-1.png and b/docs/covariates_files/figure-html/unnamed-chunk-8-1.png differ
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-9-1.png b/docs/covariates_files/figure-html/unnamed-chunk-9-1.png
index 2febb0c..cd36256 100644
Binary files a/docs/covariates_files/figure-html/unnamed-chunk-9-1.png and b/docs/covariates_files/figure-html/unnamed-chunk-9-1.png differ
diff --git a/docs/search.json b/docs/search.json
index cbd7a91..03bb979 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -172,7 +172,7 @@
"href": "covariates.html#sampling-background-data",
"title": "Covariates for observations and background",
"section": "3 Sampling background data",
- "text": "3 Sampling background data\nWe need to create a random sample of background in both time and space.\n\n3.1 Sampling time\nSampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.\n\nH = hist(obs$date, breaks = 'month', format = \"%Y\", \n freq = TRUE, main = \"Observations\",\n xlab = \"Date\")\n\n\n\n\nEmbedded in the returned value, H, are the probability densities and dates for the start of each month.\n\nmonths = H$breaks + as.Date(\"1970-01-01\")\nprobs = H$density\n\nWe can use weighted sampling so we are characterizing the environment consistent with the observations. First we make a time series that extends from the first to the last observation date.\n\ndays = seq(from = min(obs$date), to = max(obs$date), by = \"day\")\n\nNext we use the months and days to develop a look-up vector to assign a probability to each day. Did you catch that? The probability of selecting a given day depends upon the probability of an observation occurring in a given month.\n\nindex = findInterval(days, months)\nday_probs = probs[index]\n\nNow we can sample - but how many? Let’s start by selecting approximately twice as many background points as we have observation points. If it is too many then we can subsample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some is the subsequent steps making a spatial sample.\n\n\n\n\n\n\nNote\n\n\n\nNote that we set the random number generator seed. This isn’t a requirement, but we use it here so that we get the same random selection each time we render the page. Here’s a nice discussion about set.seed() usage.\n\n\n\nset.seed(1234)\nnback = nrow(obs) * 2\ndays_sample = sample(days, size = nback, replace = TRUE, prob = day_probs)\n\nSo, now we have a sampling of of dates that have a temporal distribution similar to that of the observations.\n\n\n\n\n\n\nWarning\n\n\n\nIt is possible that we maybe overfitting by weighting the samples in time. Other time-sampling strategies are available to us, so we are not stuck with the approach we are using and can easily revisit this selection.\n\n\n\n\n3.2 Sampling space\nThe sf package provides a function, st_sample(), for sampling points within a polygon. But what polygon? We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it’s advantages and disadvantages. We show how to make one of each.\n\n3.2.1 The bounding box polygon\nThis is the easiest of the three polygons to make.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')\n\nbox = sf::st_bbox(obs) |>\n sf::st_as_sfc()\n\nplot(sf::st_geometry(coast), extent = box, axes = TRUE)\nplot(box, lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nHmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.\n\n\n3.2.2 The convex hull polygon\nAlso an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.\n\nchull = sf::st_union(obs) |>\n sf::st_convex_hull()\n\nplot(sf::st_geometry(coast), extent = chull, axes = TRUE)\nplot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nWell, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.\n\n\n3.2.3 The buffered polygon\nAn alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.\n\npoly = sf::st_union(obs) |>\n sf::st_buffer(dist = 75000) |>\n sf::st_union() |>\n sf::st_simplify() |>\n smoothr::smooth(method = 'chaikin', refinements = 10L)\n\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nThat seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.\n\nok = dir.create(\"data/bkg\", recursive = TRUE, showWarnings = FALSE)\nsf::write_sf(poly, file.path(\"data\", \"bkg\", \"buffered-polygon.gpkg\"))\n\n\n\n3.2.4 Sampling the polygon\nNow to sample the within the polygon, we’ll sample the same number we selected earlier.\n\nset.seed(1234)\nbkg = sf::st_sample(poly, nback) \n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\nplot(sf::st_geometry(coast), add = TRUE)\n\n\n\n\nOK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.\n\n\n3.2.5 Purging points that are on land (or very nearshore)\nIt’s great if you have in hand a map the distinguishes between land and sea - like we do with sst. We shall extract values v from just the first sst layer (hence the slice).\n\nv = sst |>\n dplyr::slice(along = \"time\", 1) |>\n stars::st_extract(bkg) |>\n sf::st_as_sf() |>\n dplyr::mutate(is_water = !is.na(sst), .before = 1) |>\n dplyr::glimpse()\n\nRows: 17,122\nColumns: 3\n$ is_water <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,…\n$ sst <dbl> 14.506774, NA, 5.303548, 6.633548, 2.418387, 4.907097, 15.954…\n$ geometry <POINT [°]> POINT (-66.82782 39.91825), POINT (-64.0782 45.61369), …\n\n\nValues where sst are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg object and random dates (days_sample), filter to include only water.\n\nbkg = sf::st_as_sf(bkg) |>\n sf::st_set_geometry(\"geometry\") |>\n dplyr::mutate(date = days_sample, .before = 1) |>\n dplyr::filter(v$is_water)\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\n\n\n\n\nNote that the bottom of the scatter is cut off. That tells us that the sst raster has been cropped to that southern limit. We can confirm that easily.\n\nplot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = \"blue\", add = TRUE)"
+ "text": "3 Sampling background data\nWe need to create a random sample of background in both time and space.\n\n3.1 Sampling time\nSampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.\n\nH = hist(obs$date, breaks = 'month', format = \"%Y\", \n freq = TRUE, main = \"Observations\",\n xlab = \"Date\")\n\n\n\n\nWe could use weighted sampling so we are characterizing the environment consistent with the observations. But the purpose of the sampling isn’t to mimic the distrubution of observations in time, but instead to characterize the environment. So, instead we’ll make an unweighted sample across the time range. First we make a time series that extends from the first to the last observation date plus a buffer of about 1 month.\n\nn_buffer_days = 30\ndays = seq(from = min(obs$date) - n_buffer_days, \n to = max(obs$date) + n_buffer_days, \n by = \"day\")\n\nNow we can sample - but how many? Let’s start by selecting approximately four times as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some samples in the subsequent steps making a spatial sample.\n\n\n\n\n\n\nNote\n\n\n\nNote that we set the random number generator seed. This isn’t a requirement, but we use it here so that we get the same random selection each time we render the page. Here’s a nice discussion about set.seed() usage.\n\n\n\nset.seed(1234)\nnback = nrow(obs) * 4\ndays_sample = sample(days, size = nback, replace = TRUE)\n\nNow we can plot the same histogram, but with the days_sample data.\n\nH = hist(days_sample, breaks = 'month', format = \"%Y\", \n freq = TRUE, main = \"Sample\",\n xlab = \"Date\")\n\n\n\n\n\n\n3.2 Sampling space\nThe sf package provides a function, st_sample(), for sampling points within a polygon. But what polygon? We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it’s advantages and disadvantages. We show how to make one of each.\n\n3.2.1 The bounding box polygon\nThis is the easiest of the three polygons to make.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')\n\nbox = sf::st_bbox(obs) |>\n sf::st_as_sfc()\n\nplot(sf::st_geometry(coast), extent = box, axes = TRUE)\nplot(box, lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nHmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.\n\n\n3.2.2 The convex hull polygon\nAlso an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.\n\nchull = sf::st_union(obs) |>\n sf::st_convex_hull()\n\nplot(sf::st_geometry(coast), extent = chull, axes = TRUE)\nplot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nWell, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.\n\n\n3.2.3 The buffered polygon\nAn alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.\n\npoly = sf::st_union(obs) |>\n sf::st_buffer(dist = 75000) |>\n sf::st_union() |>\n sf::st_simplify() |>\n smoothr::smooth(method = 'chaikin', refinements = 10L)\n\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nThat seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.\n\nok = dir.create(\"data/bkg\", recursive = TRUE, showWarnings = FALSE)\nsf::write_sf(poly, file.path(\"data\", \"bkg\", \"buffered-polygon.gpkg\"))\n\n\n\n3.2.4 Sampling the polygon\nNow to sample the within the polygon, we’ll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes).\n\nset.seed(1234)\nbkg = sf::st_sample(poly, nback) \n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\nplot(sf::st_geometry(coast), add = TRUE)\n\n\n\n\nOK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.\n\n\n3.2.5 Purging points that are on land (or very nearshore)\nIt’s great if you have in hand a map the distinguishes between land and sea - like we do with sst. We shall extract values v from just the first sst layer (hence the slice).\n\nv = sst |>\n dplyr::slice(along = \"time\", 1) |>\n stars::st_extract(bkg) |>\n sf::st_as_sf() |>\n dplyr::mutate(is_water = !is.na(sst), .before = 1) |>\n dplyr::glimpse()\n\nRows: 34,244\nColumns: 3\n$ is_water <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE,…\n$ sst <dbl> 12.848709, 5.922903, 8.960322, 8.053871, NA, 3.865484, NA, 15…\n$ geometry <POINT [°]> POINT (-65.47837 40.75635), POINT (-65.71588 42.76753),…\n\n\nValues where sst are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg object and random dates (days_sample), filter to include only water.\n\nbkg = sf::st_as_sf(bkg) |>\n sf::st_set_geometry(\"geometry\") |>\n dplyr::mutate(date = days_sample, .before = 1) |>\n dplyr::filter(v$is_water)\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\n\n\n\n\nNote that the bottom of the scatter is cut off. That tells us that the sst raster has been cropped to that southern limit. We can confirm that easily.\n\nplot(sst['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = \"blue\", add = TRUE)"
},
{
"objectID": "covariates.html#extract-environmental-covariates-for-sst-and-wind",
-