forked from edzer/sdsr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-Large.qmd
605 lines (509 loc) · 25.1 KB
/
09-Large.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
# Large data and cloud native {#sec-large}
\index{cloud native}
\index{large datasets}
This chapter describes how large spatial and spatiotemporal datasets
can be handled with R, with a focus on packages **sf** and **stars**.
For practical use, we classify large datasets as too large
* to fit in working memory,
* to fit on the local hard drive, or
* to download to locally managed infrastructure (such as network attached storage)
\index{out-of-core}
These three categories may (today) correspond very roughly to Gigabyte-,
Terabyte- and Petabyte-sized datasets. Besides size considerations,
access and processing speed also play a role,
in particular for larger datasets or interactive applications.
Cloud native geospatial formats are formats optimised with processing
on cloud infrastructure in mind, where costs of computing and
storage need to be considered and optimised. Such costs can be
reduced by
* using compression, such as the LERC (Limited Error Raster
Compression) algorithm used for cloud-optimised GeoTIFF, or the
BLOSC compressor used for ZARR array data
* fast access to spatial sub-regions, such as by ways of HTTP range
request enabled for cloud-optimised geotiffs, or column-oriented
data access in the GeoParquet and GeoArrow formats
* accessing/viewing data at incremental resolution (low-resolution images are
available before or without the full resolution being read), implemented natively
by progressive JPEG formats or by using image pyramids (or overviews)
in other raster formats
* optimising data access with particular cloud storage or object
storage protocols in mind.
It should be noted that there are no silver bullets in this area:
optimising storage for one particular access pattern will lead to slow
access for other ways. If raster data is for instance stored for
optimal access of spatial regions at different spatial resolutions,
reading the data as pixel time series may be very slow. Compression
leads to low storage and bandwidth costs but to higher processing
costs when reading, because of the decompression involved.
## Vector data: `sf` {#sec-largesf}
\index{sf!large datasets}
\index{st\_read!inside bounding box}
### Reading from local disk
Function `st_read` reads vector data from disk, using GDAL, and
then keeps the data read in working memory. In case the file is
too large to be read in working memory, several options exist to
read parts of the file. The first is to set argument `wkt_filter`
with a WKT text string containing a geometry; only geometries from
the target file that intersect with this geometry will be returned.
An example is
```{r}
library(sf)
file <- system.file("gpkg/nc.gpkg", package = "sf")
c(xmin = -82,ymin = 36, xmax = -80, ymax = 37) |>
st_bbox() |> st_as_sfc() |> st_as_text() -> bb
read_sf(file, wkt_filter = bb) |> nrow() # out of 100
```
Here, `read_sf` is used as an alternative to `st_read` to suppress
output.
\index{st\_read!query argument}
The second option is to use the `query` argument to `st_read`,
which can be any query in "OGR SQL" dialect. This can for instance be used to
select features from a layer, or limit fields. An example is:
```{r}
q <- "select BIR74,SID74,geom from 'nc.gpkg' where BIR74 > 1500"
read_sf(file, query = q) |> nrow()
```
Note that `nc.gpkg` is the _layer name_, which can be obtained from `file` using `st_layers`.
Sequences of records can be read using `LIMIT` and `OFFSET`, to read records 51-60 use
```{r}
q <- "select BIR74,SID74,geom from 'nc.gpkg' LIMIT 10 OFFSET 50"
read_sf(file, query = q) |> nrow()
```
Further query options include selection on geometry type or polygon
area. When the dataset queried is a spatial database, then the query
is passed on to the database and not interpreted by GDAL; this means
that more powerful features will be available. Further information
is found in the GDAL documentation under "OGR SQL dialect".
\index{st\_read!vsizip}
Very large files or directories that are zipped can be read
without the need to unzip them, using the `/vsizip` (for zip),
`/vsigzip` (for gzip), or `/vsitar` (for tar files) prefix to files.
This is followed by the path to the zip file, and then followed by
the file inside this zip file. Reading files this way may come at
some computational cost.
### Reading from databases, **dbplyr**
\index{sf\_read!from database connection}
```{r, echo=FALSE}
has_PG <- Sys.getenv("DB_USERNAME") != "" && Sys.getenv("DB_HOST") != "" &&
"PostgreSQL" %in% st_drivers()$name &&
!inherits(try(DBI::dbConnect(
RPostgres::Postgres(),
host = Sys.getenv("DB_HOST"),
user = Sys.getenv("DB_USERNAME"),
dbname = "postgis"), silent = TRUE), "try-error")
```
```{r echo=FALSE, eval=has_PG}
nc <- read_sf(file)
# write nc to postgis:
write_sf(nc, paste0("PG:dbname=postgis user=", Sys.getenv("DB_USERNAME"), " host=", Sys.getenv("DB_HOST")), "nc")
```
Although GDAL has support for several spatial databases, and as
mentioned above it passes on SQL in the `query` argument to the
database, it is sometimes beneficial to directly read from and
write to a spatial database using the DBI R database drivers for this.
An example of this is:
```{r, eval=has_PG}
pg <- DBI::dbConnect(
RPostgres::Postgres(),
host = Sys.getenv("DB_HOST"),
user = Sys.getenv("DB_USERNAME"),
dbname = "postgis")
read_sf(pg, query =
"select BIR74,wkb_geometry from nc limit 3") |> nrow()
```
where the database host and user name are obtained from environment
variables, and the database name is `postgis`.
A spatial query might look like
```{r, eval=has_PG}
q <- "SELECT BIR74,wkb_geometry FROM nc WHERE \
ST_Intersects(wkb_geometry, 'SRID=4267;POINT (-81.50 36.43)');"
read_sf(pg, query = q) |> nrow()
```
Here, the intersection is carried out inside the database, and uses a spatial
index when present. The same mechanism works when using `dplyr`
with a database backend:
\index{sf!dbplyr proxy}
```{r, eval=has_PG}
library(dplyr, warn.conflicts = FALSE)
nc_db <- tbl(pg, "nc")
```
Spatial queries can be formulated and are passed on to the database:
```{r, eval=has_PG}
nc_db |>
filter(ST_Intersects(wkb_geometry,
'SRID=4267;POINT (-81.50 36.43)')) |>
collect()
nc_db |> filter(ST_Area(wkb_geometry) > 0.1) |> head(3)
```
(Note that PostGIS' `ST_Area` computes the same area as
the `AREA` field in `nc`, which is the meaningless value obtained by
interpreting the coordinates as projected, where they are ellipsoidal.)
### Reading from online resources or web services {#sec-vsi}
\index{st\_read!online resources}
GDAL drivers support reading from online resources, by prepending
the URL starting with `https://` with `/vsicurl/`. A number of
similar drivers specialised for particular clouds include `/vsis3/`
for Amazon S3, `/vsigs/` for Google Cloud Storage, `/vsiaz/` for
Azure, `/vsioss/` for Alibaba Cloud, or `/vsiswift/` for OpenStack
Swift Object Storage. @sec-zarr has an example of using `/vsicurl/`.
These prepositions can be combined with `/vsizip/` to read a file from a
zipped online resource. Depending on the file format used, reading
information this way may require reading the entire file, or reading
it multiple times, and may not always be the most efficient way of
handling resources. Cloud native formats are optimised to work
efficiently using `HTTP` requests.
\index{cloud native}
\index{cloud optimised}
### APIs, OpenStreetMap
\index{st\_read!OpenStreetMap}
\index{OpenStreetMap, reading}
Typical web services for geospatial
data create data on the fly and give access to this through an API.
As an example, data from [OpenStreetMap](https://openstreetmap.org/)
can be bulk downloaded and read locally, for instance using the GDAL vector
driver, but more typical a user wants to obtain a small subset of
the data or use the data for a small query. Several R packages exist
that query OpenStreetMap data:
* Package **OpenStreetMap** [@R-OpenStreetMap] downloads data as raster tiles, typically
used as backdrop or reference for plotting other features
* Package **osmdata** [@osmdata] downloads vector data as points, lines, or polygons
in **sf** or **sp** format
* Package **osmar** (from CRAN [archive](https://cran.r-project.org/web/packages/osmar/index.html)) returns vector data, but in addition the network
topology (as an `igraph` object) that contains how road elements
are connected, and has functions for computing the shortest route
When provided with a correctly formulated API call in the URL the
highly configurable GDAL OSM driver (in `st_read`) can read an
".osm" file (xml) and return a dataset with five layers: `points`
that have significant tags, `lines` with non-area "way" features,
`multilinestrings` with "relation" features, `multipolygons` with
"relation" features , and `other_relations`. A simple and very small
bounding box query to OpenStreetMap could look like
```{r,eval=FALSE}
download.file(paste0("https://openstreetmap.org/api/0.6/map?",
"bbox=7.595,51.969,7.598,51.970"),
"data/ms.osm", method = "auto")
```
and from this file we can read the layer `lines`, and plot its
first attribute by
```{r fig-overpass}
#| fig.cap: "OpenStreetMap vector data"
#| fig.height: 5
o <- read_sf("data/ms.osm", "lines")
p <- read_sf("data/ms.osm", "multipolygons")
st_bbox(c(xmin = 7.595, ymin = 51.969,
xmax = 7.598, ymax = 51.970), crs = 'OGC:CRS84') |>
st_as_sfc() |>
plot(axes = TRUE, lwd = 2, lty = 2, cex.axis = .5)
plot(o[,1], lwd = 2, add = TRUE)
plot(st_geometry(p), border = NA, col = '#88888888', add = TRUE)
```
the result of which is shown in @fig-overpass.
The overpass API provides a more generic and powerful query
functionality to OpenStreetMap data.
### GeoParquet and GeoArrow
\index{geoparquet}
\index{parquet}
\index{geoarrow}
\index{arrow}
\index{st\_read!geoparquet}
\index{st\_read~geoarrow}
Two formats dedicated to cloud native analysis are derived from
the Apache projects Parquet and Arrow. Both provide column oriented
storage of tabular data, meaning that the reading of a single field
for many records is fast, compared to record oriented storage of
most other databases. The Geo- extensions of both involve
* a way of storing a geometry column, either in a well-known binary
or text form, or in a more efficient form where sub-geometries
are indexed up front
* storage of a coordinate reference system.
At the time of writing this book, both formats are under active
development, but drivers for reading or creating them are available in
GDAL starting from version 3.5. Both formats allow for compressed
storage. The difference is that (Geo)Parquet is more oriented
towards persistent storage, where, (Geo)Arrow is more oriented to
fast access and fast computation. Arrow can for instance be both
an in-memory and an on-disk format.
## Raster data: **stars**
A common challenge with raster datasets is not only that they come
in large files (single Sentinel-2 tiles are around 1 GB), but that
many of these files, potentially thousands or millions, are needed to address
the area and time period of interest. In 2022,
Copernicus, the program that runs all Sentinel satellites, published 160
TB of images per day. This means that a classic pattern in using R
consisting of downloading data to local disc, loading the data in
memory, and analysing it is not going to work.
\index{Google Earth Engine}
\index{Sentinel Hub}
\index{openEO cloud}
Cloud-based Earth Observation processing platforms like Google Earth
Engine [@gorelick], [Sentinel Hub](https://www.sentinel-hub.com/) or
[openEO.cloud](https://openeo.cloud) recognise this and let users
work with datasets up to the petabyte range rather easily and with
a great deal of interactivity. They share the following properties:
* computations are postponed as long as possible (lazy evaluation)
* only the data asked for by the user is being computed and returned, and nothing more
* storing intermediate results is avoided in favour of on-the-fly computations
* maps with useful results are generated and shown quickly to allow for interactive model development
This is similar to the **dbplyr** interface to databases and
cloud-based analytics environments, but differs in the aspect of
_what_ we want to see quickly: rather than the first $n$ records of a
**dbplyr** lazy table, we want a quick _overview_ of the results,
in the form of a map covering the whole area, or part of it, but
at screen resolution rather than native (observation) resolution.
\index{raster!resolution}
\index{resolution!low-resolution computation}
If for instance we want to "see" results for the United States on
a screen with 1000 $\times$ 1000 pixels, we only need to compute results
for this many pixels, which corresponds roughly to data
on a grid with 3000 m $\times$ 3000 m grid cells. For Sentinel-2
data with 10 m resolution, this means we can downsample with
a factor 300, giving 3 km $\times$ 3 km resolution. Processing,
storage, and network requirements then drop a factor $300^2 \approx 10^5$, compared
to working on the native 10 m $\times$ 10 m resolution. On the platforms
mentioned, zooming in the map triggers further computations on a
finer resolution and smaller extent.
A simple optimisation that follows these lines is how the plot method
for `stars` objects works. In the case of plotting large rasters,
it downsamples the array before it plots, drastically saving time.
The degree of downsampling is derived from the plotting region size
and the plotting resolution (pixel density). For vector devices,
such as pdf, R sets plot resolution to 75 dpi, corresponding to
0.3 mm per pixel. Enlarging plots may reveal this, but replotting
to an enlarged devices will create a plot at target density. For
`geom_stars` the user has to specify the `downsample` rate, because
the **ggplot2** does not make the device size available to that
function.
### `stars` proxy objects
\index{stars!proxy objects}
\index{read\_stars!proxy objects}
To handle datasets that are too large to fit in memory, **stars**
provides `stars_proxy` objects. To demonstrate its use, we will
use the **starsdata** package, an R data package with larger datasets
(around 1 GB total). It can be installed by
```{r eval=FALSE}
options(timeout = 600) # or larger in case of slow network
install.packages("starsdata",
repos = "http://cran.uni-muenster.de/pebesma/", type = "source")
```
We can "load" a Sentinel-2 image from it by
```{r}
library(stars) |> suppressPackageStartupMessages()
f <- paste0("sentinel/S2A_MSIL1C_20180220T105051_N0206",
"_R051_T32ULE_20180221T134037.zip")
granule <- system.file(file = f, package = "starsdata")
file.size(granule)
base_name <- strsplit(basename(granule), ".zip")[[1]]
s2 <- paste0("SENTINEL2_L1C:/vsizip/", granule, "/", base_name,
".SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
(p <- read_stars(s2, proxy = TRUE))
object.size(p)
```
and we see that this does not actually load _any_ of the pixel
values but keeps the reference to the dataset and fills the
dimensions table. (The convoluted `s2` name is needed to point
GDAL to the right file inside the `.zip` file containing 115 files
in total).
\index{stars!lazy evaluation}
\index{lazy evaluation of computations}
\newpage
The idea of a proxy object is that we can build expressions like
```{r}
p2 <- p * 2
```
but that the computations for this are postponed. Only when we
really need the data, is `p * 2` evaluated. We need data when we:
* `plot` data,
* write an object to disk, with `write_stars`, or
* explicitly load an object in memory, with `st_as_stars`
In case the entire object does not fit in memory, `plot` and
`write_stars` choose different strategies to deal with this:
* `plot` fetches only the pixels that can be seen by downsampling, rather than reading all
pixels available
* `write_stars` reads, processes, and writes data chunk-by-chunk
```{r fig-plotp, echo=!knitr::is_latex_output(), message=!knitr::is_latex_output()}
#| fig.cap: "Downsampled 10 m bands of a Sentinel-2 scene"
#| code-fold: true
plot(p)
```
Downsampling and chunking is implemented for spatially dense images,
but not for dense time series or other dense dimensions.
As an example, the output of `plot(p)`, shown in @fig-plotp
only fetches the pixels that can be seen on the plot device, rather
than the 10980 $\times$ 10980 pixels available in each band. The downsampling
ratio taken is
```{r, echo=!knitr::is_latex_output()}
#| code-fold: true
(ds <- floor(sqrt(prod(dim(p)) / prod(dev.size("px")))))
```
meaning that for every `r ds` $\times$ `r ds` sub-image in the
original image, only one pixel is read and plotted.
### Operations on proxy objects
\index{stars\_proxy!lazy operations}
Several dedicated methods are available for `stars_proxy` objects:
```{r}
methods(class = "stars_proxy")
```
We have seen `plot` and `print` in action; `dim` reads out
the dimension from the dimensions metadata table.
\index{stars\_proxy!st\_as\_stars}
\index{st\_as\_stars}
The three methods that actually fetch data are `st_as_stars`,
`plot` and `write_stars`. `st_as_stars` reads the actual data into a
`stars` object, its argument `downsample` controls the downsampling
rate. `plot` does this too, choosing an appropriate `downsample`
value from the device resolution, and plots the object. `write_stars`
writes a `stars_proxy` object to disk.
All other methods for `stars_proxy` objects do not actually operate
on the raster data but add the operations to a _to do_ list
attached to the object. Only when actual raster data are fetched,
for instance when `plot` or `st_as_stars` are called, the commands
in this list are executed.
\index{st\_crop!stars\_proxy}
`st_crop` limits the extent (area) of the raster that will be
read. `c` combines `stars_proxy` objects, but still doesn't read
any data. `adrop` drops empty dimensions and `aperm` can change dimension
order.
`write_stars` reads and processes its input chunk-wise; it has an
argument `chunk_size` that lets users control the size of spatial
chunks.
### Remote raster resources
\index{cloud-optimised GeoTIFF}
\index{GeoTIFF!cloud-optimised}
A format like "cloud-optimised GeoTIFF" (COG) has been specially
designed to be efficient and resource-friendly in many cases,
e.g., for only reading the metadata, or for only reading overviews
(low-resolution versions of the full imagery) or spatial regions
using the `/vsixxx/` mechanisms (@sec-vsi). COGs can also
be created using the GeoTIFF driver of GDAL, and setting the right
dataset creation options in a `write_stars` call.
## Very large data cubes
\index{ERA5}
\index{Copernicus}
\index{Landsat}
At some stage, datasets need to be analysed that are so large that
downloading them is no longer feasible; even when local storage would
be sufficient, network bandwidth may become limiting. Examples are
satellite image archives such as those from Landsat and Copernicus
(Sentinel-x), or model computations such as the ERA5 [@era5], a
model reanalysis of the global atmosphere, land surface, and ocean
waves from 1950 onwards. In such cases it may be most helpful to gain
access to virtual machines in a cloud that have these data available,
or to use a system that lets the user carry out computations without
having to worry about virtual machines and storage. Both options
will be discussed.
### Finding and processing assets
When working on a virtual machine on a cloud, a first task is usually
to find the assets (files) to work on. It looks attractive to obtain
a file listing, and then parse file names such as
```
S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip
```
for their metadata including the date of acquisition and the code
of the spatial tile covered. Obtaining such a file listing however
is usually computationally very demanding, as is the processing of
the result, when the number of tiles runs in the many millions.
A solution to this is to use a catalogue. The recently developed
and increasingly deployed STAC, short for _spatiotemporal asset
catalogue_, provides an API that can be used to query image
collections by properties like bounding box, date, band, and cloud
coverage. The R package **rstac** [@R-rstac] provides an R interface
to create queries and manage the information returned.
Processing the resulting files may involve creating a data cube at
a lower spatial and/or temporal resolution, from images that may
span a range of coordinate reference systems (e.g., several UTM
zones). An R package that creates a regular data cube from such a
collection of images is **gdalcubes** [@R-gdalcubes; @appel2019demand],
which can also directly use a STAC [@appelblog] to identify images.
### Cloud native storage: Zarr {#sec-zarr}
\index{Zarr}
\index{GDAL!multi-dimensional arrays}
Where COG provides cloud native storage for raster imagery, Zarr is a
format for cloud native storage of large multi-dimensional arrays. It
can be seen as a successor of NetCDF and seems to follow similar
conventions, being used by the climate and forecast communities [@cf].
Zarr "files" are really directories with sub-directories containing
compressed chunks of the data. The compression algorithm and
the chunking strategy will have an effect on how fast particular
sub-cubes can be read or written.
Function `stars::read_mdim` can read entire data cubes but has
options for reading sub-cubes by specifying for each dimension
offset, number of pixels, and the step size to read a dimension
at a lower resolution [@zarrblog]. Similarly, `stars::write_mdim`
can write multi-dimensional arrays to Zarr or NetCDF files, or other
formats that support the GDAL C++ multi-dimensional array API.
\index{read\_mdim}
\index{write\_mdim}
\index{NetCDF}
To read a remote (cloud-based) Zarr file, one needs to prepend the
URL with indicators about the format and the access protocol:
```{r}
dsn = paste0('ZARR:"/vsicurl/https://ncsa.osn.xsede.org',
'/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr"')
```
after which we can read the first 10 time steps using
```{r}
library(stars)
bounds = c(longitude = "lon_bounds", latitude = "lat_bounds")
(r = read_mdim(dsn, bounds = bounds, count = c(NA, NA, 10)))
st_bbox(r)
```
In this example,
* `NA` values for `count` indicate to get all values available for that dimension
* here, `bounds` variables needed explicit specification because the data
source did not follow the more recent CF (1.10) conventions; and
ignoring bounds would lead to a raster with a bounding box having
latitude values outside $[-90,90]$.
### APIs for data: GEE, openEO
\index{Google Earth Engine}
\index{openEO}
Platforms that do not require the management and programming of
virtual machines _in_ the cloud but provide direct access to the
imagery managed include GEE, openEO, and the climate data store.
Google Earth Engine (GEE) is a cloud platform that allows users
to compute on large amounts of Earth Observation data as well
as modelling products [@gorelick]. It has powerful analysis
capabilities, including most of the data cube operations explained
in @sec-dcoperations. It has an interface where scripts can
be written in JavaScript, and a Python interface to the same
functionality. The code of GEE is not open-source, and cannot be
extended by arbitrary user-defined functions in languages like
Python or R. R package **rgee** [@R-rgee] provides an R client
interface to GEE.
Cloud-based data cube processing platforms built entirely around
open source software are emerging, several of which using the openEO
API [@openEO]. This API allows for user-defined functions (UDFs)
written in Python or R that are being passed on through the API
and executed at the pixel level, for instance to aggregate or
reduce dimensions using custom reducers. UDFs in R represent the
data chunk to be processed as a `stars` object; in Python `xarray`
objects are used.
Other platforms include the Copernicus climate data store [@cds]
or atmosphere data store, which allow processing of atmospheric
or climate data from ECMWF, including ERA5. An R package with an
interface to both data stores is **ecmwfr** [@R-ecmwfr].
\index{ERA5}
## Exercises
Use R to solve the following exercises.
1. For the S2 image (above), find out in which order the bands are by
using `st_get_dimension_values()`, and try to find out which spectral bands / colours they correspond to.
1. Compute NDVI for the S2 image, using `st_apply` and an appropriate
`ndvi` function. Plot the result to screen, and then write the
result to a GeoTIFF. Explain the difference in runtime between
plotting and writing.
1. Plot an RGB composite of the S2 image, using the `rgb` argument
to `plot()`, and then by using `st_rgb()` first.
1. Select five random points from the bounding box of `S2`, and extract
the band values at these points; convert the object returned to an `sf`
object.
1. For the 10-km radius circle around `POINT(390000 5940000)`, use
`aggregate` to compute the mean pixel values of the S2 image
when downsampling the images with factor 30, and on the original
resolution. Compute the relative difference between the results.
1. Use `hist` to compute the histogram on the downsampled S2 image.
Also do this for each of the bands. Use `ggplot2` to compute a
single plot with all four histograms.
1. Use `st_crop` to crop the S2 image to the area covered by the 10-km circle.
Plot the results. Explore the effect of setting argument `crop = FALSE`
1. With the downsampled image, compute the logical layer where all four
bands have pixel values higher than 1000. Use a raster algebra expression
on the four bands (use `split` first), or use `st_apply` for this.