-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
496 lines (360 loc) · 16.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
---
title: "L2TDatabase"
output:
github_document:
toc: true
toc_depth: 2
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE, warning = FALSE}
library("dplyr", warn.conflicts = FALSE)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
# Override dplyr/dbplyr print methods to obscure the server address
db_desc.MySQLConnection <- function (x) {
info <- DBI::dbGetInfo(x)
info$user <- "demo_user"
info$host <- "dummy.host.name"
paste0("mysql ", info$serverVersion, " [", info$user, "@",
info$host, ":", info$port, "/", info$dbname, "]")
}
tbl_sum.tbl_sql <- function(x) {
grps <- dbplyr::op_grps(x$ops)
sort <- dbplyr::op_sort(x$ops)
c(
"Source" = dbplyr:::tbl_desc(x),
"Database" = db_desc(x$src$con),
if (length(grps) > 0) c("Groups" = dbplyr:::commas(grps)),
if (length(sort) > 0) c("Ordered by" = dbplyr:::commas(dbplyr:::deparse_all(sort)))
)
}
```
This R package contains helper functions for working with the MySQL database for
the [Learning To Talk](http://learningtotalk.org) project.
Installation
-------------------------------------------------------------------------------
Install the `devtools` package. Then install the package from GitHub.
```{r, eval = FALSE}
install.packages("devtools")
devtools::install_github("LearningToTalk/L2TDatabase")
```
Connecting with .cnf files
-------------------------------------------------------------------------------
Connections to the database are managed by [.cnf files][cnf-ref]. We use these
files so that login credentials and connection information are not hard-coded
into analysis scripts.
This package provides the helper function `make_cnf_file()` which creates a .cnf
file from login and connection information. Once the file is created, we can
just point to this file whenever we want to connect to the database.
```{r, results = 'hide'}
library("L2TDatabase")
# initialze a cnf file using all default (empty) values
make_cnf_file(dest = "./my_connection.cnf")
# all values filled
make_cnf_file(
dest = "./my_connection.cnf",
user = "tj",
password = "dummy-password",
db = "my_db",
host = "localhost",
port = 3306)
```
We use the function `l2t_connect()` to connect to the database. This function
takes the location of a .cnf file and the name of the database and returns
a connection to the database. By default, `l2t_connect` connects to the `l2t`
database.
```{r, results = 'hide'}
# connect to the database
l2t <- l2t_connect(cnf_file = "./inst/l2t_db.cnf", db_name = "l2t")
```
Using dplyr to look at the database
-------------------------------------------------------------------------------
This package is built on top of [dplyr][dplyr-page], a package for dealing with
data stored in tables. dplyr provides a set of tools for working with remote
data sources --- that is, data in databases, usually on other computers.
Conventionally, to access data in a database, one has to write special queries
to retrieve information from the database. dplyr lets us write R code for our
queries, and it translates our R code into the language used by the database.
(See the dplyr [vignette on databases][dplyr-db] for more information on how to
work with remotely stored data using dplyr.)
In the terminology of dplyr, a remote _source_ of data is a `src`, and a _table_
of data is a `tbl`. To connect to a table of data, use `tbl(src, tbl_name)`.
For example, here's how I would connect to the `MinPair_Trials` table in the
database which contains the trial-level data about the minimal pairs experiment.
```{r}
library("dplyr", warn.conflicts = FALSE)
# use tbl to create a link to a tbl in the database
minp_resp <- tbl(src = l2t, from = "MinPair_Trials")
minp_resp
```
With dplyr, I can perform all kinds of operations on this table. Here, I
`filter()` rows to keep just the _man_-_moon_ practice trials and `select()` a
subset of columns.
```{r}
man_moon_practice <- minp_resp %>%
filter(MinPair_Item1 == "man", MinPair_Item2 == "moon") %>%
select(Study, ResearchID, MinPair_TargetItem, MinPair_Correct)
man_moon_practice
```
This data lives in "the cloud" on a remote computer. That's why the first line
of the print out says `Source: lazy query [?? x 4]`.
When we print the data as we did above, dplyr downloads just enough rows of data
to give us a preview of the data. There are approximately 12,000 rows of data in
the `minp_resp` table, and this just-a-preview behavior prevented us from
accidentally or prematurely downloading thousands of rows when we peeked at the
data. We have to **use `collect()` to download data to our computer**.
```{r}
man_moon_practice <- collect(man_moon_practice)
man_moon_practice
```
In this printout, there is no longer a line specifying the source. Instead, we
are told that we have a `tibble`, which is just a kind of data-frame. The data
now lives locally, in our R session. Now, we can plot or model this data like
any other data-frame in R.
**Take-away**: We use dplyr to create queries for data from tables in a
database, and we use `collect()` download the results of the query to our
computer.
L2T Database conventions
-------------------------------------------------------------------------------
### User-ready data lives in the default database `l2t`
Information about our participants and their testing data are stored in two
separate databases. The first is `l2t`. This database contains user-friendly,
analysis-ready tables of information. `l2t` is the default database that
`l2t_connect()` uses. This database probably has the information that you need
in a ready to use form.
```{r}
# list all the tbls in the database
src_tbls(l2t)
```
The tables here are _queries_: Tables that are computed on-the-fly whenever the
data is requested. For example, `MinPair_Aggregate` shows the proportion correct
of non-practice trials in the minimal pairs task by participant and by study. (I
`select()` a subset of columns to exclude unnecessary columns like the name of
the Eprime file containing the raw data.) The `Study` and `ResearchID` are the
conventional identifiers for studies and participants.
```{r, warning = FALSE}
tbl(l2t, "MinPair_Aggregate") %>%
select(Study, ResearchID, MinPair_Dialect, MinPair_ProportionCorrect)
```
The values in `MinPair_ProportionCorrect` are not stored or hard-coded but
computed on the fly as part of a query. Thus, if for some reason, the
trial-level data about the experiment were to change in the database, then the
`MinPair_ProportionCorrect` column would update automatically.
### Raw data lives in `backend`
The tables in `l2t` are queries, so they are assembled by combining information
from tables of raw data. The raw-data tables live in the second database (the
`backend` for our queries.) These raw data tables do not contain any of our
participant IDs or study names, so they are not user-friendly. You probably
don't need to work with the backend of the database, unless you are developing
new aggregations of data that are not yet presented in the `l2t` database.
```{r}
l2t_backend <- l2t_connect("./inst/l2t_db.cnf", db_name = "backend")
# list all the tbls in the database
src_tbls(l2t_backend)
```
Some of the tables in the backend are not tables of raw data but
intermediate, helper queries that are used in the main database. These
helpers queries are prefixed with `q_`. For example, `q_Household_Education`,
`q_Household_Maternal_Caregiver`, and `q_Household_Max_Maternal_Education` are a
pipeline of calculations that determine the highest maternal education level in
each household.
### Other databases
Our eyetracking data lives in `eyetracking`. I separated it from the other kinds
of data because it contains individual frames of eyetracking data, so it houses
a tremendous amount of data. One of the tables has at least 10 millions rows of
data. The database contains user-ready queries. They are prefixed by `q_`.
```{r}
l2t_eyetracking <- l2t_connect("./inst/l2t_db.cnf", db_name = "eyetracking")
src_tbls(l2t_eyetracking)
```
There is also a database `norms` with some raw-score-to-standardized-score
look-up tables for some standardized tests.
**Take-away**: The data you probably want lives in the default database `l2t`.
Metadata
-------------------------------------------------------------------------------
As I've worked on the back-end of the database, I've been using the database
comments to describe the data that goes into each table and each field. We can
download these comments along with other pieces of information about a table by
using `describe_tbl()`. With this function, we can quickly create a "codebook"
to accompany our data.
```{r}
describe_tbl(src = l2t, tbl_name = "MinPair_Aggregate")
```
In some queries, the fields are computed dynamically, whenever the data
is requested. In the `MinPair_Aggregate` query, the proportion correct is
calculated when the data is requested. Our database system does not let us write
comments for these dynamically created columns, so that column has a blank for
its description.
We can also download the table-level comments from a database with
`describe_db()`, although these descriptions have a much tighter length limit.
Table-level comments are also unavailable for query tables, so they are only
useful for raw-data tables.
```{r}
# just a few rows
describe_db(src = l2t_backend) %>% head()
```
These two forms of metadata are backed up by the `l2t_backup()` helper function.
Back up
-------------------------------------------------------------------------------
We can download and back up each table in a database with `l2t_backup()`. The
final two messages from the back-up function show that the metadata tables are
saved to a `metadata` folder.
Here's how backing up the backend of the database looks:
```{r, warning = FALSE}
# back up each tbl
backup_dir <- "./inst/backup"
all_tbls <- l2t_backup(src = l2t_backend, backup_dir = backup_dir)
# l2t_backup() also returns each tbl in a list, so we can view them as well.
all_tbls$EVT
```
### Dumping the database
A final option for backing up the database is `dump_database()`. This function
calls on the `mysqldump` utility which exports a database into a series of SQL
statements that can be used to reconstruct the database. This function is very
finicky because it requires other programs to be installed on one's machine.
```{r, eval = FALSE}
dump_database(
cnf_file = "./inst/l2t_db.cnf",
backup_dir = "./inst/backup",
db_name = "l2t")
dump_database(
cnf_file = "./inst/l2t_db.cnf",
backup_dir = "./inst/backup",
db_name = "backend")
```
Writing new data to a database
-------------------------------------------------------------------------------
dplyr provides read-only access to a database, so we can't accidentally do
stupid things to our data. We want to use R to migrate existing dataframes into
the database, but we also don't want to do stupid things either. Therefore, I've
developed conservative helper functions for writing data. These functions work
on dplyr-managed database connections. For the purposes of this demo, we will
work on the separate `l2t_test` database.
The function `append_rows_to_table()` simply adds new rows to a database table.
```{r}
l2t_test <- l2t_connect("./inst/l2t_db.cnf", db_name = "l2ttest")
# Before writing
tbl(l2t_test, "TestWrites")
# Add rows to table
append_rows_to_table(
src = l2t_test,
tbl_name = "TestWrites",
rows = data_frame(Message = "Hello!"))
# After writing
tbl(l2t_test, "TestWrites")
```
I also have an _experimental_ helper function. `overwrite_rows_in_table()`
which will update existing rows in a table, but this one is not as robust or
user-friendly as I would like. In my scripts, I usually have lots of checks
on the data before and after using this function to confirm that it behaves as
expected.
```{r clean up, echo = FALSE, results = 'hide'}
DBI::dbSendQuery(l2t_test$con, "DELETE FROM `l2ttest`.`TestWrites`")
file.remove("./my_connection.cnf")
```
Other helpers
-------------------------------------------------------------------------------
This package also provides some helper functions for working with our data.
`undo_excel_date()` converts Excel's dates into R dates. `chrono_age()` computes
the number of months (rounded down) between two dates, as you would when
computing chronological age.
```{r}
# Create a date and another 18 months later
dates <- list()
dates$t1 <- undo_excel_date(41659)
dates$t2 <- undo_excel_date(41659 + 365 + 181)
str(dates)
# Chrono age in months, assuming t1 is a birthdate
chrono_age(dates$t2, dates$t1)
# More chrono_age examples
chrono_age("2014-01-20", "2012-01-20")
chrono_age("2014-01-20", "2011-12-20")
chrono_age("2014-01-20", "2011-11-20")
```
Repository structure
-------------------------------------------------------------------------------
This repository is an R package, so the `R/`, `man/` and `tests/` contain
the source code, documentation, and unit tests for the package.
Most of the action is in the `inst/` directory. `inst/migrations/` contains
R scripts that add data from our various spreadsheets and csv files to
the database. `inst/audit/` contains some helper scripts that check for
inconsistencys in our data. `inst/views/` contains the source code for our SQL
queries used by the database.
Data audits
-------------------------------------------------------------------------------
We have some scripts that audit our data-sets. The results of these checks are
printed here.
```{r, echo = FALSE, message = FALSE}
evt <- readr::read_csv("./inst/audit/results_evt.csv") %>%
mutate(Link = "[EVT](inst/audit/audit_evt.md)")
ppvt <- readr::read_csv("./inst/audit/results_ppvt.csv") %>%
mutate(Link = "[PPVT](inst/audit/audit_ppvt.md)")
integrity <- readr::read_csv("./inst/audit/results_integrity.csv") %>%
mutate(Link = "[Data entry discrepancies](inst/audit/check_integrity.md)")
docs <- readr::read_csv("./inst/audit/results_metadata.csv") %>%
mutate(Link = "[MySQL metadata completeness](inst/audit/check_documentation.md)")
checks <- bind_rows(evt, ppvt, integrity, docs) %>%
mutate(Result = ifelse(Passing, ":white_check_mark:", ":x:")) %>%
select(Check, Date, Passing, Result, Link)
knitr::kable(checks)
```
Study coverage
-------------------------------------------------------------------------------
The following table summarizes how many scores/administrations of each task
were collected for each study. It is included here to show which tasks and which
studies have been migrated into the database.
```{r, echo = FALSE, warning = FALSE, message = FALSE}
tbls_to_check <- c(
BRIEF = "BRIEF",
Blending = "Blending_Summary",
CTOPP_Memory = "CTOPP_Memory",
CTOPP_Blending = "CTOPP_Blending",
CTOPP_Elision = "CTOPP_Elision",
DELV_Risk = "DELV_Risk",
DELV_Variation = "DELV_Variation",
EVT = "EVT",
FruitStroop = "FruitStroop",
GFTA = "GFTA",
KBIT = "KBIT",
LENA = "LENA_Averages",
MinPair = "MinPair_Aggregate",
PPVT = "PPVT",
Rhyming = "Rhyming_Aggregate",
SAILS = "SAILS_Aggregate",
VerbalFluency = "VerbalFluency")
count_studies <- function(tbl_name, db = l2t) {
responses <- tbl(db, tbl_name) %>%
collect() %>%
select(everything(), Date = ends_with("Completion")) %>%
filter(!is.na(Date))
count_responses(responses)
}
count_responses <- . %>%
group_by(Study) %>%
summarise(Participants = n_distinct(ResearchID),
Administrations = n()) %>%
select(Study, Participants, Administrations)
study_counts <- tbls_to_check %>%
lapply(count_studies)
# Count eyetracking administrations too
db_et <- l2t_connect("./inst/l2t_db.cnf", "eyetracking")
et_blocks <- tbl(db_et, "Blocks") %>%
left_join(tbl(db_et, "q_BlocksByStudy")) %>%
select(Study, ResearchID, Task, Block_Basename) %>%
collect() %>%
mutate(Task = paste0("Eyetracking: ", Task))
et_counts <- et_blocks %>%
split(.$Task) %>%
lapply(count_responses)
df_counts <- c(study_counts, et_counts) %>%
bind_rows(.id = "Table") %>%
arrange(Table, Study)
knitr::kable(df_counts)
```
[cnf-ref]: http://svitsrv25.epfl.ch/R-doc/library/RMySQL/html/RMySQL-package.html
[dplyr-db]: http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
[blog-backup]: http://tjmahr.com/post/127080928329/using-dplyr-to-back-up-a-mysql-database
[dplyr-page]: https://cran.rstudio.com/web/packages/dplyr/