switch to unwieghted time sampling

BigelowLab · Oct 16, 2023 · 62cea45 · 62cea45
1 parent b552644
commit 62cea45
Show file tree

Hide file tree

Showing 20 changed files with 139 additions and 159 deletions.
diff --git a/covariates.qmd b/covariates.qmd
@@ -1,6 +1,6 @@
 ---
 title: "Covariates for observations and background"
-cache: true
+cache: false
 ---
 
 Here we do the multi-step task of associating observations with environmental covariates and creating a background point data set used by the model to characterize the environment.
@@ -86,43 +86,35 @@ H = hist(obs$date, breaks = 'month', format = "%Y",
      freq = TRUE, main = "Observations",
      xlab = "Date")
 ```
-Embedded in the returned value, `H`, are the probability densities and dates for the start of each month. 
 
-```{r}
-months = H$breaks + as.Date("1970-01-01")
-probs = H$density
-```
-
-We can use weighted sampling so we are characterizing the environment consistent with the observations. First we make a time series that extends from the first to the last observation date.
+We **could** use weighted sampling so we are characterizing the environment consistent with the observations. But the purpose of the sampling isn't to mimic the distrubution of observations in time, but instead to characterize the environment.  So, instead we'll make an unweighted sample across the time range.  First we make a time series that extends from the first to the last observation date **plus** a buffer of about 1 month.
 
 ```{r}
-days = seq(from = min(obs$date), to = max(obs$date), by = "day")
+n_buffer_days = 30
+days = seq(from = min(obs$date) - n_buffer_days, 
+           to = max(obs$date) + n_buffer_days, 
+           by = "day")
 ```
 
-Next we use the `months` and `days` to develop a look-up vector to assign a probability to each day. *Did you catch that?*  The probability of selecting a given **day** depends upon the probability of an observation occurring in a given **month**.
-
-```{r}
-index = findInterval(days, months)
-day_probs = probs[index]
-```
-
-Now we can sample - **but how many**?  Let's start by selecting approximately **twice** as many background points as we have observation points. If it is too many then we can subsample as needed, if it isn't enough we can come back an increase the number.  In addition, we may lose some is the subsequent steps making a spatial sample.
+Now we can sample - **but how many**?  Let's start by selecting approximately **four times** as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn't enough we can come back an increase the number.  In addition, we may lose some samples in the subsequent steps making a spatial sample.
 
 :::{.callout-note}
 Note that we set the random number generator seed. This isn't a requirement, but we use it here so that we get the same random selection each time we render the page.  Here's a nice discussion about `set.seed()` [usage](https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function).
 :::
 
 ```{r}
 set.seed(1234)
-nback = nrow(obs) * 2
-days_sample = sample(days, size = nback, replace = TRUE, prob = day_probs)
+nback = nrow(obs) * 4
+days_sample = sample(days, size = nback, replace = TRUE)
 ```
 
-So, now we have a sampling of of dates that have a temporal distribution similar to that of the observations.
+Now we can plot the same histogram, but with the `days_sample` data.
 
-:::{.callout-warning}
-It is possible that we maybe [overfitting](https://en.wikipedia.org/wiki/Overfitting) by weighting the samples in time. Other time-sampling strategies are available to us, so we are not stuck with the approach we are using and can easily revisit this selection.
-:::
+```{r}
+H = hist(days_sample, breaks = 'month', format = "%Y", 
+     freq = TRUE, main = "Sample",
+     xlab = "Date")
+```
 
 ### Sampling space
 
@@ -185,7 +177,7 @@ sf::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg"))
 
 #### Sampling the polygon
 
-Now to sample the within the polygon, we'll sample the same number we selected earlier.
+Now to sample the within the polygon, we'll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes). 
 
 ```{r}
 set.seed(1234)

diff --git a/covariates_files/figure-html/unnamed-chunk-11-1.png b/covariates_files/figure-html/unnamed-chunk-11-1.png
diff --git a/covariates_files/figure-html/unnamed-chunk-13-1.png b/covariates_files/figure-html/unnamed-chunk-13-1.png
diff --git a/covariates_files/figure-html/unnamed-chunk-14-1.png b/covariates_files/figure-html/unnamed-chunk-14-1.png
diff --git a/covariates_files/figure-html/unnamed-chunk-6-1.png b/covariates_files/figure-html/unnamed-chunk-6-1.png
diff --git a/covariates_files/figure-html/unnamed-chunk-7-1.png b/covariates_files/figure-html/unnamed-chunk-7-1.png
diff --git a/covariates_files/figure-html/unnamed-chunk-8-1.png b/covariates_files/figure-html/unnamed-chunk-8-1.png
diff --git a/covariates_files/figure-html/unnamed-chunk-9-1.png b/covariates_files/figure-html/unnamed-chunk-9-1.png
diff --git a/data/bkg/bkg-covariates.gpkg b/data/bkg/bkg-covariates.gpkg
diff --git a/data/bkg/buffered-polygon.gpkg b/data/bkg/buffered-polygon.gpkg
diff --git a/data/obs/obs-covariates.gpkg b/data/obs/obs-covariates.gpkg
diff --git a/docs/covariates.html b/docs/covariates.html
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-11-1.png b/docs/covariates_files/figure-html/unnamed-chunk-11-1.png
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-13-1.png b/docs/covariates_files/figure-html/unnamed-chunk-13-1.png
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-14-1.png b/docs/covariates_files/figure-html/unnamed-chunk-14-1.png
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-6-1.png b/docs/covariates_files/figure-html/unnamed-chunk-6-1.png
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-7-1.png b/docs/covariates_files/figure-html/unnamed-chunk-7-1.png
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-8-1.png b/docs/covariates_files/figure-html/unnamed-chunk-8-1.png
diff --git a/docs/covariates_files/figure-html/unnamed-chunk-9-1.png b/docs/covariates_files/figure-html/unnamed-chunk-9-1.png
diff --git a/docs/search.json b/docs/search.json
@@ -172,7 +172,7 @@
     "href": "covariates.html#sampling-background-data",
     "title": "Covariates for observations and background",
     "section": "3 Sampling background data",
-    "text": "3 Sampling background data\nWe need to create a random sample of background in both time and space.\n\n3.1 Sampling time\nSampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.\n\nH = hist(obs$date, breaks = 'month', format = \"%Y\", \n     freq = TRUE, main = \"Observations\",\n     xlab = \"Date\")\n\n\n\n\nEmbedded in the returned value, H, are the probability densities and dates for the start of each month.\n\nmonths = H$breaks + as.Date(\"1970-01-01\")\nprobs = H$density\n\nWe can use weighted sampling so we are characterizing the environment consistent with the observations. First we make a time series that extends from the first to the last observation date.\n\ndays = seq(from = min(obs$date), to = max(obs$date), by = \"day\")\n\nNext we use the months and days to develop a look-up vector to assign a probability to each day. Did you catch that? The probability of selecting a given day depends upon the probability of an observation occurring in a given month.\n\nindex = findInterval(days, months)\nday_probs = probs[index]\n\nNow we can sample - but how many? Let’s start by selecting approximately twice as many background points as we have observation points. If it is too many then we can subsample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some is the subsequent steps making a spatial sample.\n\n\n\n\n\n\nNote\n\n\n\nNote that we set the random number generator seed. This isn’t a requirement, but we use it here so that we get the same random selection each time we render the page. Here’s a nice discussion about set.seed() usage.\n\n\n\nset.seed(1234)\nnback = nrow(obs) * 2\ndays_sample = sample(days, size = nback, replace = TRUE, prob = day_probs)\n\nSo, now we have a sampling of of dates that have a temporal distribution similar to that of the observations.\n\n\n\n\n\n\nWarning\n\n\n\nIt is possible that we maybe overfitting by weighting the samples in time. Other time-sampling strategies are available to us, so we are not stuck with the approach we are using and can easily revisit this selection.\n\n\n\n\n3.2 Sampling space\nThe sf package provides a function, st_sample(), for sampling points within a polygon. But what polygon? We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it’s advantages and disadvantages. We show how to make one of each.\n\n3.2.1 The bounding box polygon\nThis is the easiest of the three polygons to make.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')\n\nbox = sf::st_bbox(obs) |&gt;\n  sf::st_as_sfc()\n\nplot(sf::st_geometry(coast), extent = box, axes = TRUE)\nplot(box, lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nHmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.\n\n\n3.2.2 The convex hull polygon\nAlso an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.\n\nchull = sf::st_union(obs) |&gt;\n  sf::st_convex_hull()\n\nplot(sf::st_geometry(coast), extent = chull, axes = TRUE)\nplot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nWell, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.\n\n\n3.2.3 The buffered polygon\nAn alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.\n\npoly =  sf::st_union(obs) |&gt;\n  sf::st_buffer(dist = 75000) |&gt;\n  sf::st_union() |&gt;\n  sf::st_simplify() |&gt;\n  smoothr::smooth(method = 'chaikin', refinements = 10L)\n\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nThat seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.\n\nok = dir.create(\"data/bkg\", recursive = TRUE, showWarnings = FALSE)\nsf::write_sf(poly, file.path(\"data\", \"bkg\", \"buffered-polygon.gpkg\"))\n\n\n\n3.2.4 Sampling the polygon\nNow to sample the within the polygon, we’ll sample the same number we selected earlier.\n\nset.seed(1234)\nbkg = sf::st_sample(poly, nback) \n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\nplot(sf::st_geometry(coast), add = TRUE)\n\n\n\n\nOK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.\n\n\n3.2.5 Purging points that are on land (or very nearshore)\nIt’s great if you have in hand a map the distinguishes between land and sea - like we do with sst. We shall extract values v from just the first sst layer (hence the slice).\n\nv = sst |&gt;\n  dplyr::slice(along = \"time\", 1) |&gt;\n  stars::st_extract(bkg) |&gt;\n  sf::st_as_sf() |&gt;\n  dplyr::mutate(is_water = !is.na(sst), .before = 1) |&gt;\n  dplyr::glimpse()\n\nRows: 17,122\nColumns: 3\n$ is_water &lt;lgl&gt; TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,…\n$ sst      &lt;dbl&gt; 14.506774, NA, 5.303548, 6.633548, 2.418387, 4.907097, 15.954…\n$ geometry &lt;POINT [°]&gt; POINT (-66.82782 39.91825), POINT (-64.0782 45.61369), …\n\n\nValues where sst are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg object and random dates (days_sample), filter to include only water.\n\nbkg = sf::st_as_sf(bkg) |&gt;\n  sf::st_set_geometry(\"geometry\") |&gt;\n  dplyr::mutate(date = days_sample, .before = 1) |&gt;\n  dplyr::filter(v$is_water)\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\n\n\n\n\nNote that the bottom of the scatter is cut off. That tells us that the sst raster has been cropped to that southern limit. We can confirm that easily.\n\nplot(sst['sst'] |&gt; dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = \"blue\", add = TRUE)"
+    "text": "3 Sampling background data\nWe need to create a random sample of background in both time and space.\n\n3.1 Sampling time\nSampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.\n\nH = hist(obs$date, breaks = 'month', format = \"%Y\", \n     freq = TRUE, main = \"Observations\",\n     xlab = \"Date\")\n\n\n\n\nWe could use weighted sampling so we are characterizing the environment consistent with the observations. But the purpose of the sampling isn’t to mimic the distrubution of observations in time, but instead to characterize the environment. So, instead we’ll make an unweighted sample across the time range. First we make a time series that extends from the first to the last observation date plus a buffer of about 1 month.\n\nn_buffer_days = 30\ndays = seq(from = min(obs$date) - n_buffer_days, \n           to = max(obs$date) + n_buffer_days, \n           by = \"day\")\n\nNow we can sample - but how many? Let’s start by selecting approximately four times as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn’t enough we can come back an increase the number. In addition, we may lose some samples in the subsequent steps making a spatial sample.\n\n\n\n\n\n\nNote\n\n\n\nNote that we set the random number generator seed. This isn’t a requirement, but we use it here so that we get the same random selection each time we render the page. Here’s a nice discussion about set.seed() usage.\n\n\n\nset.seed(1234)\nnback = nrow(obs) * 4\ndays_sample = sample(days, size = nback, replace = TRUE)\n\nNow we can plot the same histogram, but with the days_sample data.\n\nH = hist(days_sample, breaks = 'month', format = \"%Y\", \n     freq = TRUE, main = \"Sample\",\n     xlab = \"Date\")\n\n\n\n\n\n\n3.2 Sampling space\nThe sf package provides a function, st_sample(), for sampling points within a polygon. But what polygon? We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it’s advantages and disadvantages. We show how to make one of each.\n\n3.2.1 The bounding box polygon\nThis is the easiest of the three polygons to make.\n\ncoast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf')\n\nbox = sf::st_bbox(obs) |&gt;\n  sf::st_as_sfc()\n\nplot(sf::st_geometry(coast), extent = box, axes = TRUE)\nplot(box, lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nHmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.\n\n\n3.2.2 The convex hull polygon\nAlso an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don’t you’ll get a convex hull around every point… oops.\n\nchull = sf::st_union(obs) |&gt;\n  sf::st_convex_hull()\n\nplot(sf::st_geometry(coast), extent = chull, axes = TRUE)\nplot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nWell, that’s an improvement, but we still get large areas vacant of observations and most of Nova Scotia.\n\n\n3.2.3 The buffered polygon\nAn alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the “shrink-wrap” version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.\n\npoly =  sf::st_union(obs) |&gt;\n  sf::st_buffer(dist = 75000) |&gt;\n  sf::st_union() |&gt;\n  sf::st_simplify() |&gt;\n  smoothr::smooth(method = 'chaikin', refinements = 10L)\n\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(obs), pch = \"+\", col = 'blue', add = TRUE)\n\n\n\n\nThat seems the best yet, but we still sample on land. We’ll over sample and toss out the ones on land. Let’s save this polygon in case we need it later.\n\nok = dir.create(\"data/bkg\", recursive = TRUE, showWarnings = FALSE)\nsf::write_sf(poly, file.path(\"data\", \"bkg\", \"buffered-polygon.gpkg\"))\n\n\n\n3.2.4 Sampling the polygon\nNow to sample the within the polygon, we’ll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes).\n\nset.seed(1234)\nbkg = sf::st_sample(poly, nback) \n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\nplot(sf::st_geometry(coast), add = TRUE)\n\n\n\n\nOK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.\n\n\n3.2.5 Purging points that are on land (or very nearshore)\nIt’s great if you have in hand a map the distinguishes between land and sea - like we do with sst. We shall extract values v from just the first sst layer (hence the slice).\n\nv = sst |&gt;\n  dplyr::slice(along = \"time\", 1) |&gt;\n  stars::st_extract(bkg) |&gt;\n  sf::st_as_sf() |&gt;\n  dplyr::mutate(is_water = !is.na(sst), .before = 1) |&gt;\n  dplyr::glimpse()\n\nRows: 34,244\nColumns: 3\n$ is_water &lt;lgl&gt; TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE,…\n$ sst      &lt;dbl&gt; 12.848709, 5.922903, 8.960322, 8.053871, NA, 3.865484, NA, 15…\n$ geometry &lt;POINT [°]&gt; POINT (-65.47837 40.75635), POINT (-65.71588 42.76753),…\n\n\nValues where sst are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We’ll merge our bkg object and random dates (days_sample), filter to include only water.\n\nbkg = sf::st_as_sf(bkg) |&gt;\n  sf::st_set_geometry(\"geometry\") |&gt;\n  dplyr::mutate(date = days_sample, .before = 1) |&gt;\n  dplyr::filter(v$is_water)\n\nplot(sf::st_geometry(coast), extent = poly, axes = TRUE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = 'blue', add = TRUE)\n\n\n\n\nNote that the bottom of the scatter is cut off. That tells us that the sst raster has been cropped to that southern limit. We can confirm that easily.\n\nplot(sst['sst'] |&gt; dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)\nplot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)\nplot(sf::st_geometry(bkg), pch = \".\", col = \"blue\", add = TRUE)"
   },
   {
     "objectID": "covariates.html#extract-environmental-covariates-for-sst-and-wind",