-
Notifications
You must be signed in to change notification settings - Fork 0
/
tutorial_openfiles.Rmd
275 lines (206 loc) · 14.9 KB
/
tutorial_openfiles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
title: "Import data from different file formats into R"
author: "Freda Wan"
date: "4 November, 2021"
output: html_document
---
 
<style type="text/css">
.table {
width: 80%;
}
</style>
This tutorial covers basic functions to import data from different formats and access them as tibbles.
## At a glance
File type Extension package::function
---------- ------------- ----------------------------
CSV .csv [readr::read_csv()](#ref-csv)
Excel .xls or .xlsx [readxl:read_excel()](#ref-xls)
Plain text .txt [readr::read_lines()](#ref-txt)
SPSS .sav [haven::read_sav()](#ref-spss)
Binary R data .rds [readr::read_rds()](#ref-rds)
JSON .json [jsonlite::fromJSON()](#ref-json)
 
## 1. CSV{#ref-csv}
Most will be familiar with importing common-separated values (CSV) files using the `read_csv()` function offered in Tidyverse's 'readr' package. But what if your data is separated by tabs, semicolons, or other characters such as "|"? It is helpful to know that similar functions in 'readr' can readily help you import your data into R.
- For tab-separated files (.tsv), use `read_tsv()`.
- For semicolon-separated files, where the comma is the decimal point not the separator, use `read_csv2()`.
Here is the syntax.
```{r, warning=FALSE, message=TRUE, eval=FALSE}
#library(tidyverse)
tbl_csv <- read_csv("filename.csv")
tbl_csv2 <- read_csv2("filename.csv")
tbl_tsv <- read_tsv("filename.tsv")
```
### Other delimiters
If your data are delimited by other characters such as "|", use `read_delim()` and specify the delimiter.
For example, if the text file looks like the dummy data below, we can see that "|" is the delimiter. (To try out the code below, you can copy the text, save it to a file and name it `delim_data.txt`.)
```
ParticipantID|Condition1|Condition2|Condition3|Control
130059284|0.4|0.01|0.2|0
290100722|0.3|0.02|0.3|1
387005398|0.5|0.01|0.4|0
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
library(tidyverse)
```
```{r, warning=FALSE}
dat_delim <- read_delim("delim_data.txt", delim = "|")
```
R is now able to read the data as a tibble:
```{r, echo=FALSE, warning=FALSE, message=FALSE}
knitr::kable(dat_delim, align = "c") %>% kableExtra::kable_styling("striped")
```
### Changing the data type for each column
Sometimes we want to specify the data type for each column. When we load the data, R generates a message (as above) to indicate the data type of each column. In our case, R has recognised all 5 columns as doubles or `dbl`, which is a class of numeric data, similar to float in other programming languages.
Let's say we want `ParticipantID` as characters and `Control` as boolean (i.e., TRUE or FALSE, known as logical in R), we can add the `col_types` argument.
In the code below, `col_types = "c???l"` specifies that for our 5 columns, we want to change the data type of `ParticipantID` to character and `Control` to logical. Since we don't need to change the data type for Condition1, Condition2, and Condition3, we use `?` to allow R to guess.
```{r, warning=FALSE}
dat_delim_col_types <- read_delim("delim_data.txt", delim = "|", col_types = "c???l")
spec(dat_delim_col_types) # check column data types
```
You can specify the data type you want by using one character to represent each column. The column types are: `c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time,` `? = guess, or _/- to skip the column`.
The argument `col_types` is also used in other functions in this tutorial, including `read_excel()` for Excel data and `read_table()` for reading text as a tibble.
### Write data to CSV
After loading data from various file formats and having cleaned and wrangled the data, you may want to save the tibble to .csv. This would allow you to view the data later or share the file with others without having to run the whole R script again.
Here is the syntax.
`write_csv(cleaned_data, "cleaned_data.csv")`
Read more about 'readr' package and see the [cheat sheet here](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf). Related packages include 'readxl' (detailed below) and 'googlesheets4' which allows you to read data from and write data to Google Sheets.
## 2. Excel (.xls or .xlsx){#ref-xls}
We use the 'readxl' package, part of Tidyverse, to read Excel data into R. Try out the 'readxl' functions below either using your data or download the [Demo.xlsx file here](./Demo.xlsx) (111KB). The data have been adapted from a public domain dataset, Chocolate Bar Ratings ([Tatman, 2017](#ref-choc)), which contains expert ratings of 1700 chocolate bars.
For demonstration purposes, the demo file contains 3 sheets, sorted by chocolate makers' company names starting A to G (Sheet 1), H to Q (Sheet 2), and R to Z (Sheet 3). You can access sheets either by name or by index.
```{r, warning=FALSE, message=FALSE}
# Suggested: install 'readxl' and load separately
library(readxl)
dat_excel<- readxl::read_excel("Demo.xlsx") # by default, this loads Sheet 1 only.
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
knitr::kable(head(dat_excel), align = "c") %>% kableExtra::kable_styling("striped")
```
If you check `dat_excel` (using either `summary()` from base R or `dplyr::glimpse()`), you will see that if you don't specify the sheet, only the first sheet is loaded.
Use the 'readr' function `excel_sheets()` to get a list of sheet names.
```{r, warning=FALSE, message=FALSE}
excel_sheets("Demo.xlsx")
```
Then, you can specify which sheet you want to load, by sheet name or index.
```{r, warning=FALSE, message=FALSE}
sheet_by_name <- read_excel("Demo.xlsx", sheet = "Companies_H_to_Q") #load sheet 2 by name
sheet_by_index <- read_excel("Demo.xlsx", sheet = 3) #load sheet by index
```
### Dealing with formulas and formatted cells
Sometimes you may encounter Excel data that contain formulas written in VBA macros or highlighted cells in yellow or in bold.
If you want to extract the formulas or formatted cells from Excel, there are R functions that could help. The 'tidyxl' package has the function `tidyxl::xlsx_cells()` which reads the property of each cell in an Excel spreadsheet, such as the data type, formatting, whether it is a formula, the cell's font, height and width. Please refer to the ['tidyxl' vignette](https://cran.r-project.org/web/packages/tidyxl/vignettes/tidyxl.html).
However, proceed with caution! Spreadsheet errors, either due to human mistakes or Excel's autocorrect functions, have raised reproducibility concerns in behavioural science and genetics research (see [Alexander, 2013](#ref-alexander); [Lewis, 2021](#ref-lewis)). If you see strange behaviour from your Excel data, check the file: Are the formulas referring to the correct cells? Has a large number been autocorrected into a date? When in doubt, open the file in Excel to check.
## 3. Plain text (.txt){#ref-txt}
When your data are in .txt files, you can either use readLines() from base or readr::read_lines() from Tidyverse. Here, we use the latter, as it runs faster for large datasets.
As an example, we use a simple dataset from [Stigliani and Grill-Spector (2018)](#ref-stigliani), a neuroscience study. You can [click here](https://osf.io/vzy8h/download) to download the data directly (392 bytes).
First, we read in the file to see what it contains.
```{r}
#load the tidyverse package
readr_text <- read_lines(("Exp1_Run1.txt"), n_max=10)
readr_text
```
The `n_max` argument above specifies how many lines of data you would like to read. We can see that we need to skip the first 3 lines for the data to form a tibble. We use `read_table()` to create the tibble.
```{r, warning=FALSE, message=FALSE}
dat_txt <- read_table(file = "Exp1_Run1.txt", skip = 3) # skip first 3 lines
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
knitr::kable(dat_txt, align = "c") %>% kableExtra::kable_styling("striped")
```
By default, the argument `col_names` is set to TRUE, so that the first row of text input will be imported as column names. If you set it to FALSE, the column names will be generated automatically as X1, X2, X3, etc.
If you want to rename columns in the same line of code, you can enter a character vector for `col_names`. For example:
```{r, warning=FALSE, message=FALSE}
dat_txt_new_column_names <- read_table(file = "Exp1_Run1.txt", skip = 4, col_names = c("Trial_new", "Condition_new", "Onset_new", "Duration_new", "Image_new"))
#since we are renaming the columns, we skip 4 lines
```
The column names will become:
```{r, warning=FALSE, message=FALSE, echo=FALSE}
knitr::kable(head(dat_txt_new_column_names, n=1), align = "c") %>% kableExtra::kable_styling("striped")
```
Of course, another way to rename columns would be using `dplyr::rename(data, new_name = old_column_name)` or `dplyr::rename_with(data, function)`. For example, running the following will turn all column names to upper case. Try it yourself and see.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
dat_txt_upper <- rename_with(dat_txt, toupper)
```
## 4. SPSS data (.sav){#ref-spss}
We will use the 'haven' package, which is part of Tidyverse, to import SPSS data into a tibble.
The example below uses SPSS data from [Norman et al.(2021)](#ref-norman), Study 2, which examines adult identity. You can [click here](https://osf.io/a6hxy/download) to download the data directly (564KB).
```{r, warning=FALSE, message= FALSE, echo=TRUE}
# Suggested: install the package 'haven' and load it in addition to Tidyverse.
library(haven)
dat_sav <- haven::read_sav("EA Across Age Data.sav")
```
The tibble `dat_sav` has 173 columns and 658 rows. If you only need to load a subset of columns, the `col_select` argument allows you to select columns by index or by column name. Below is an example of using `col_select` and what the output looks like. This would be an alternative to `dplyr::select()`.
```{r, warning=FALSE, message= FALSE}
#load the first 8 columns
dat_select_by_index <- read_sav("EA Across Age Data.sav", col_select=(1:8))
```
```{r, warning=FALSE, message=FALSE, echo=FALSE}
knitr::kable(head(dat_select_by_index), align = "c") %>% kableExtra::kable_styling("striped")
```
 
```{r, warning=FALSE, message=FALSE}
#load columns with name starting with "IDEA"
dat_select_by_colname <- read_sav("EA Across Age Data.sav",
col_select = starts_with("IDEA"))
```
```{r, echo=FALSE, warning=FALSE, message= FALSE}
library(tidyverse)
knitr::kable(head(dat_select_by_colname), align="c") %>% kableExtra::kable_styling("striped") %>% kableExtra::scroll_box(width = "99%")
```
#### Related
If you encounter .por files, which are ASCII text data files generated by SPSS, use the `haven::read_por()` function.
The 'haven' package can also read Stata and SAS files into R. [Read more](https://haven.tidyverse.org)
## 5. Binary R data (.rds){#ref-rds}
RDS files store datasets in a compressed format to save storage space. Additionally, RDS preserves data types such as dates and factors, so we don't need to worry about redefining data types after reading the file into R.
To read .rds files, use either `readRDS()` from baseR or `read_rds()` from Tidyverse's 'readr' package. We use Tidyverse in the example below.
The example below uses data from [Lukavsky (2018)](#ref-lukavsky), Experiment 1. The study investigates participants' ability to recognise what they have seen in their central and peripheral vision. You can [click here](https://osf.io/dy5hv/download) to download the data directly (185KB).
```{r, warning=FALSE, message=FALSE}
#library(tidyverse)
dat_binary <- read_rds("exp1.rds")
```
You will see that this dataset has over 5300 rows and 26 columns. Here is what the first 6 lines look like.
```{r, warning=FALSE, message=FALSE, echo=FALSE}
knitr::kable(head(dat_binary), align = "c") %>% kableExtra::kable_styling("striped") %>% kableExtra::scroll_box(width = "99%")
```
## 6. JSON{#ref-json}
JSON files store nested lists or data in a tree-like structure. We will use the 'jsonlite' package to view and access the data in R.
You can download an [example.json](./example.json) file here (4KB). The data are sourced from the [International Union for Conservation of Nature Red List of Threatened Species](https://www.iucnredlist.org).
```{r, warning=FALSE, message=FALSE}
# install the 'jsonlite' package first
library(jsonlite)
dat_json <- fromJSON(txt="example.json", simplifyDataFrame = FALSE)
```
From the RStudio Viewer you would see that the data contain information about 4 animals. ![](./dat_json_screenshot.jpg){width=750px}
You can also navigate the data using `names()` from base or simply type `dat_json$"Panthera leo"`. The dollar sign `$` refers to a variable or column. In RStudio, as you type in `data_object_name$`, the available variables or columns will be shown for your choice.
```{r, message=FALSE, warning=FALSE}
names(dat_json) #gets names of what's in the object
names(dat_json$`Panthera tigris`) # get variable names one level down. Use `` or "" for variable names containing spaces.
```
Use `as_tibble()` to put the data into a tibble for further processing.
```{r, messages = FALSE, warning=FALSE}
tiger_conservation <- dat_json$`Panthera tigris`$`conservation actions in place` %>% as_tibble()
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
knitr::kable(tiger_conservation) %>% kableExtra::kable_styling("striped")
```
You can transpose the tibble so it is easier to read.
```{r}
tiger_conversation_long_view <- tiger_conservation %>% pivot_longer(cols = everything())
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
knitr::kable(tiger_conversation_long_view) %>% kableExtra::kable_styling("striped")
```
## Reference
Alexander, R. (2013, April 20). Reinhart, Rogoff... and Herndon: The student who caught out the profs. BBC. [https://www.bbc.com/news/magazine-22223190](https://www.bbc.com/news/magazine-22223190){#ref-alexander}
International Union for Conservation of Nature. (n.d.). The IUCN Red List of Threatened Species. [https://www.iucnredlist.org](https://www.iucnredlist.org)
Lewis, D. (2021, August 25). Autocorrect errors in Excel still creating genomics headache. Nature. [https://www.nature.com/articles/d41586-021-02211-4](https://www.nature.com/articles/d41586-021-02211-4){#ref-lewis}
Lukavsky, J. (2018, December 5). Scene categorization in the presence of a distractor. Retrieved from [osf.io/849wm](https://osf.io/849wm){#ref-lukavsky}
Norman, K., Hernandez, L., & Obeid, R. (2021, January 12). Study 2. Who Am I? Gender Differences in Identity Exploration During the Transition into Adulthood. Retrieved from [osf.io/agfvz](https://osf.io/agfvz){#ref-norman}
Stigliani, A., & Grill-Spector, K. (2018, July 5). Temporal Channels. [https://doi.org/10.17605/OSF.IO/MW5PK](https://doi.org/10.17605/OSF.IO/MW5PK){#ref-stigliani}
Tatman, R. (2017). Chocolate Bar Ratings. Kaggle Datasets. [https://www.kaggle.com/rtatman/chocolate-bar-ratings](https://www.kaggle.com/rtatman/chocolate-bar-ratings){#ref-choc}