-
Notifications
You must be signed in to change notification settings - Fork 1
/
10_renv.qmd
384 lines (269 loc) · 16 KB
/
10_renv.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: "`library(renv)`"
abstract: "This sections introduces enviroment management with a focus on `library(renv)`."
format:
html:
code-line-numbers: true
fig-align: "center"
editor_options:
chunk_output_type: console
bibliography: references.bib
---
![Lácar Lake, of glacial origin, in the province of Neuquén, Argentina](images/LAGO_LACAR.jpg)
```{r hidden-here-load}
#| include: false
exercise_number <- 1
```
```{r}
#| echo: false
#| warning: false
library(tidyverse)
library(gt)
source("src/motivation.R")
```
```{r}
#| label: tbl-roadmap
#| tbl-cap: "Opinionated Analysis Development"
#| echo: false
motivation |>
filter(!is.na(Section), Section == "Environment Management") |>
select(-`Analysis Feature`) |>
arrange(Section) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
tab_footnote(
footnote = "Added by Aaron R. Williams",
locations = cells_column_labels(columns = c(Tool, Section))
) |>
tab_source_note(
source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
)
```
## Problem Definition
Every time we run a line of R code or Python code, we rely on an entire stack of software and hardware that affects how our line of R code or Python code runs.
::: callout-tip
## Computing Environment
The packages, software, and hardware that supports running code for an analysis.
:::
Changing computing environments can lead to many issues with reproducibility.
- Python and R packages can change in ways that change the results of an analysis. The packages may change because the authors
- introduce new functionality
- improve the package interface
- discover and fix bugs
- Beneath packages, Python or R can change in ways that change the results of an analysis.
- Adjacent to programming languages, compilers and linear algebra libraries can change or even the computer operating system can change.
- Hardware can change in ways that change the results of an analysis.
Each example above corresponds to a layer of a computing environment.
- package
- system
- hardware
> Ignoring the readiness of the data science environment results in the dreaded *it works on my machine* phenomenon with a failed attempt to share code with a colleague or deploy an app to production. \~ [@gold2024]
This is core to reproducibility. Unfortunately, this is where we must temper our expectations.
::: callout-warning
A perfectly reproducible environment isn't possible. We quickly hit the point of diminishing returns where system-level factors like machine precision and pseudo random processes (seeds) affect results.
:::
We will focus on the packages layer of our environment. It is the place where we can get the highest return on investment for our work.
- We can avoid situations where our 2018 analysis using 2018 R packages breaks using 2024 R packages.
- We can avoid situations where our 2024 R packages don't work on our friend's computer.
- We can intentionally use older versions of R packages.
- We can make it easy to move from our computer to a cloud computer where we have scalable computing power.
## Ideal Solution
What does an ideal solution look like for managing the package layer of a computing environment?
1. **Isolate:** We should be able to **isolate** the package environment at the project level. That means we can install, update, or remove packages in our current project without affecting any other projects. This means we'll have project-specific versions of packages.
2. **Document:** We should be able to document our package environment so the package environment is **reproducible**.
3. **Share:** Our environment should be **portable**. More precisely, we should be able to share documentation about our environment so someone else (or our future selves) can recreate the environment.
::: callout-tip
## Library
A **library** is a folder that contains installed packages. Libraries can hold at most one version of a package.
:::
Running the `.libPaths()` function shows the location of the library used by an R session.
By default, R's library will be the system library. We're interested in creating a project-specific library.
::: callout-tip
## State
The condition of a computing environment at a point-in-time.
:::
::: callout-tip
## Repository
A **repository** is a source of packages. The most popular repository is the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/), which we typically use when we run `install.packages()`.
:::
We're interested in documenting the repository used to install each package.
::: {.callout-tip}
## Virtual environment
A virtual environment is a collection of packages and software that support a project that are isolated from other projects.
:::
We will use `library(renv)` to create a virtual environment to track the state of our computing environment with a project-specific library with packages deliberately installed from specific repositories.
## `library(renv)`
`library(renv)`, short for **r**eproducible **env**ironment, allows us to create project-specific virtual environments.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Install `renv` with `install.packages("renv")`.
2. Run the function `.libPaths()` at the console.
:::
Our workflow will have three steps.
### 1. Isolate
::: callout-note
## Isolate
We should be able to **isolate** the package environment at the project level. That means we can install, update, or remove packages in our current project without affecting any other projects. This means we'll have project-specific versions of packages.
:::
The `init()` function creates a project-specific virtual environment with a project-specific library. This means the R session will use packages from a project library instead of a system library. Running `init()` creates three new items in a project:
- `renv/library/` is the project-specific library.
- `renv.lock` contains metadata that describes the project-specific library.
- `.Rprofile` is a hidden system file. It isn't specific to `library(renv)`, but in this case it tells R to use the project library instead of the system library.
At first, this library won't have anything in it. This is a little extra work! But we can use `install()` and `update()` to add R packages to the project-specific library.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Run `install.packages("dplyr")` and then `library(dplyr)`. The `dplyr` package should load.
2. Run `renv::init()`. You will get a long message.
3. Run `library(dplyr)`
4. Run `.libPaths()`
:::
### 2. Document
::: callout-note
## Document
We should be able to document our package environment so the package environment is **reproducible**.
:::
The `snapshot()` function documents the current project environment by updating metadata in the `renv.lock` file. `snapshot()` will install, update, or uninstall any packages that are in an inconsistent state and update the lockfile to represent the current state of dependencies in the project.
The lockfile (`renv.lock`) contains JSON that documents the current state of dependencies for a project. For instance, if we install and use `library(palmerpenguins)`, the lockfile will look like:
```
{
"R": {
"Version": "4.3.1",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://packagemanager.posit.co/cran/latest"
}
]
},
"Packages": {
"palmerpenguins": {
"Package": "palmerpenguins",
"Version": "0.1.1",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
"R"
],
"Hash": "6c6861efbc13c1d543749e9c7be4a592"
},
"renv": {
"Package": "renv",
"Version": "1.0.7",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
"utils"
],
"Hash": "397b7b2a265bc5a7a06852524dabae20"
}
}
}
```
The `status()` gives us a snapshot of the documented project environment.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Run `renv::status()`
2. Run `ren::snapshot()`
3. Create an R script. Load `library(dplyr)`.
4. Run `renv::status()`
5. Run `ren::snapshot()`
:::
### 3. Share
::: {.callout-note}
## Share
We should be able to share documentation about our environment so someone else (or our future selves) can recreate the environment.
:::
The `restore()` function will use files created by `snapshot()` to recreate a project environment.
We won't actually directly run this function often. If we share a project that uses `renv`, RStudio should automatically ask us if we want to download and install the documented packages using `restore()`.
We likely won't need to use `restore()` on the computer where `init()` was run. Rather, you should see a prompt when opening up a .Rproj:
```
Project '~/presentations/test-project' loaded. [renv 1.0.7]
```
We'll need to commit multiple files to Git to share an renv virtual project environment:
- `renv.lock`
- `.Rprofile`
- `renv/settings.json`
- `renv/activate.R`
If a Git repository has already been initialized, then `init()` will automatically add files that *should not* be shared to the .gitignore in the `renv/` folder:
- `renv/library/`
- Any other folder in `renv/`
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Clone the [penguins-analysis GitHub repository](https://github.com/awunderground/penguin-analysis).
2. Open up the project and confirm `library(renv)` recreates the package-layer of the computing environment. You *may* need to run `renv::restore()`.
3. Run the code in `analysis.R`.
:::
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Run `renv::init()` in `example-analysis/`.
2. Install the necessary R packages with `renv::install()`.
3. Use `renv::snapshot()` to document the state of the project layer of the computing environment.
:::
### More about `library(renv)`
`renv` doesn't install packages in a project directory. Instead, renv makes references to user-level packages, which saves space and install time.
`renv` doesn't acknowledge a dependency until it is used somewhere in the project! `dependencies()` will show the .R scripts and Quarto documents where dependencies are created.
`update()` updates a package that has already been installed and `remove()` removes a package that has been installed.
`deactivate()` is like hitting pause on the project environment. It shifts the project to using the system library but doesn't delete any of the renv files in the directory. `reactivate()` is the opposite of `deactive()`.
`renv::deactivate(clean = TRUE)` is dynamite. It shifts the project to using the system library and deletes all of the renv files. There is no going back. At this point, using renv in the project will require starting from scratch with `renv::init()`.
If the repository has a Git history, `history()` can explore past versions of the project environment and `revert()` can return to an earlier version of the project library. With this in mind, it may make sense to begin using `renv` near the beginning of a project instead of at the end.
### Going deeper
`renv` solves environment management for the package layer of the computing environment but it doesn't help with the system layer or the hardware layer. We'll briefly cover some other tools that can help with the system layer and hardware layer.
### Conda
[Conda](https://anaconda.org/anaconda/conda) can help with the system layer in addition to managing the package layer.
> Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.
Interestingly, @gold2024 is critical of Docker:
> Conda allows you to create a virtual environment in user space on your laptop without having admin access. It’s especially useful when your machine is locked down by IT.
>
> That’s not a great fit for a production environment. Conda smashes together the language version, the package management, and, sometimes, the system library management. This is conceptually simple and easy to use, but it often goes awry in production environments. In a production environment (or a shared workbench server), I recommend people manage Python packages with a virtual environment tool like `{venv}` and manage system libraries and versions of Python with tools built for those purposes.
### Docker
::: {.callout-tip}
## Container
A container is a self-contained system for running computer software. Typically, containers are designed to be small and fit-for-use with specific analyses.
:::
Containerization is the process of creating a self-contained computer environment to run an analysis. Lars Vilhuber, the Data Editor for the American Economic Association, [has advocated for using containers in economics research](https://github.com/larsvilhuber/ssb-demo/blob/master/code/Description-docker-conf.md).
[Docker](https://www.docker.com/) is a popular container tool. Docker can manage the software layer of a computing environment and the package layer of a computing environment. Docker can:
- Specify the computer operating system
- Control system dependencies like the version of Pandoc, BLAS, and compilers
- Control the R or Python version
- Manage R packages and Python packages
- Manage the version of the code that's run
[DockerHub](https://hub.docker.com/) is a popular repository for sharing Docker images.
It's worth noting that the packages within Docker can be controlled with `renv` and the version of the code can be controlled with Git and GitHub.
The foundations of most containers are standard images. [Rocker](https://rocker-project.org/) and [Posit Images](https://github.com/rstudio/r-docker) provide useful starting images.
- [Docker 101 for data scientists](https://solutions.posit.co/envs-pkgs/environments/docker/)
- [DevOps for Data Science chapter 6](https://do4ds.com/chapters/sec1/1-6-docker.html)
- [renv + Docker](https://rstudio.github.io/renv/articles/docker.html)
### Cloud Computing
The explosion of popularity of cloud computing has expanded options for managing the hardware layer of a computing environment. [Amazon Web Services](https://aws.amazon.com/), [Google Cloud](https://cloud.google.com/), and [Microsoft Azure](https://azure.microsoft.com/en-us) provide on-demand cloud computing environments with predictable, consistent, and documentable hardware.
These cloud computing environments have out-of-pocket marginal costs, but the costs are frequently cheaper than maintaining on-premise computing environments like servers. The costs are definitely cheaper than maintaining old, on-premise infrastructure to reproduce old computing environments.
The following is a sensible workflow:
- Spin up a cloud instance (computer) with a specific operating system.
- Pick a Docker image from a Docker repository. This image should be close to fit-for-purpose.
- Set up Docker. Maybe set up renv. Create Dockerfiles and renv files.
- Run the project with good version control.
- Save the Docker files and renv
## Final Thoughts
**This is a lot!** Managing the software layer and hardware layer of a computing environment can improve the reproducibility of a project, but the rewards quickly diminish and the complexity quickly increases.
Improving project organization and documentation, literate programming, version control, programming best practices, and managing the package layer of a computing environment will almost always yield more benefits than focusing on the software layer or hardware layer of a computing environment.