forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-L3-intro-to-R.Rmd
290 lines (209 loc) · 12 KB
/
09-L3-intro-to-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
output: html_document
---
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(out.width='90%', fig.align = 'center')
```
# Introduction to R/Bioconductor
## Installing packages
### CRAN
The Comprehensive R Archive Network [CRAN](https://cran.r-project.org/) is the biggest archive of R packages. There are few requirements for uploading packages besides building and installing succesfully, hence documentation and support is often minimal and figuring how to use these packages can be a challenge it itself. CRAN is the default repository R will search to find packages to install:
```{r, eval=FALSE}
install.packages("devtools")
require("devtools")
```
### Github
[Github](https://github.com/) isn't specific to R, any code of any type in any state can be uploaded. There is no guarantee a package uploaded to github will even install, nevermind do what it claims to do. R packages can be downloaded and installed directly from github using the "devtools" package installed above.
```{r, eval=FALSE}
devtools::install_github("tallulandrews/M3Drop")
```
Github is also a version control system which stores multiple versions of any package. By default the most recent "master" version of the package is installed. If you want an older version or the development branch this can be specified using the "ref" parameter:
```{r, eval=FALSE}
# different branch
devtools::install_github("tallulandrews/M3D", ref="nbumi")
# previous commit
devtools::install_github("tallulandrews/M3Drop", ref="434d2da28254acc8de4940c1dc3907ac72973135")
```
Note: make sure you re-install the M3Drop master branch for later in the course.
### Bioconductor
Bioconductor is a repository of R-packages specifically for biological analyses. It has the strictest requirements for submission, including installation on every platform and full documentation with a tutorial (called a vignette) explaining how the package should be used. Bioconductor also encourages utilization of standard data structures/classes and coding style/naming conventions, so that, in theory, packages and analyses can be combined into large pipelines or workflows.
```{r, eval=FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
```
Note: in some situations it is necessary to substitute "http://" for "https://" in the above depending on the security features of your internet connection/network.
Bioconductor also requires creators to support their packages and has a regular 6-month release schedule. Make sure you are using the most recent release of bioconductor before trying to install packages for the course.
```{r, eval=FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("BiocUpgrade")
```
### Source
The final way to install packages is directly from source. In this case you have to download a fully built source code file, usually packagename.tar.gz, or clone the github repository and rebuild the package yourself. Generally this will only be done if you want to edit a package yourself, or if for some reason the former methods have failed.
```{r, eval=FALSE}
install.packages("M3Drop_3.05.00.tar.gz", type="source")
```
## Installation instructions:
All the packages necessary for this course are available [here](https://github.com/hemberg-lab/scRNA.seq.course/blob/master/Dockerfile). Starting from "RUN Rscript -e "install.packages('devtools')" ", run each of the commands (minus "RUN") on the command line or start an R session and run each of the commands within the quotation marks. Note the ordering of the installation is important in some cases, so make sure you run them in order from top to bottom.
## Data-types/classes
R is a high level language so the underlying data-type is generally not important. The exception if you are accessing R data directly using another language such as C, but that is beyond the scope of this course. Instead we will consider the basic data classes: numeric, integer, logical, and character, and the higher level data class called "factor". You can check what class your data is using the "class()" function.
Aside: R can also store data as "complex" for complex numbers but generally this isn't relevant for biological analyses.
### Numeric
The "numeric" class is the default class for storing any numeric data - integers, decimal numbers, numbers in scientific notation, etc...
```{r}
x = 1.141
class(x)
y = 42
class(y)
z = 6.02e23
class(z)
```
Here we see that even though R has an "integer" class and 42 could be stored more efficiently as an integer the default is to store it as "numeric". If we want 42 to be stored as an integer we must "coerce" it to that class:
```{r}
y = as.integer(42)
class(y)
```
Coercion will force R to store data as a particular class, if our data is incompatible with that class it will still do it but the data will be converted to NAs:
```{r}
as.numeric("H")
```
Above we tried to coerce "character" data, identified by the double quotation marks, into numeric data which doesn't make sense, so we triggered ("threw") an warning message. Since this is only a warning R would continue with any subsequent commands in a script/function, whereas an "error" would cause R to halt.
### Character/String
The "character" class stores all kinds of text data. Programing convention calls data containing multiple letters a "string", thus most R functions which act on character data will refer to the data as "strings" and will often have "str" or "string" in it's name. Strings are identified by being flanked by double quotation marks, whereas variable/function names are not:
```{r}
x = 5
a = "x" # character "x"
a
b = x # variable x
b
```
In addition to standard alphanumeric characters, strings can also store various special characters. Special characters are identified using a backlash followed by a single character, the most relevant are the special character for tab : `\t` and new line : `\n`. To demonstrate the these special characters lets concatenate (cat) together two strings with these characters separating (sep) them:
```{r}
cat("Hello", "World", sep= " ")
cat("Hello", "World", sep= "\t")
cat("Hello", "World", sep= "\n")
```
Note that special characters work differently in different functions. For instance the `paste` function does the same thing as `cat` but does not recognize special characters.
```{r}
paste("Hello", "World", sep= " ")
paste("Hello", "World", sep= "\t")
paste("Hello", "World", sep= "\n")
```
Single or double backslash is also used as an `escape` character to turn off special characters or allow quotation marks to be included in strings:
```{r}
cat("This \"string\" contains quotation marks.")
```
Special characters are generally only used in pattern matching, and reading/writing data to files. For instance this is how you would read a tab-separated file into R.
```{r, eval=FALSE}
dat = read.delim("file.tsv", sep="\t")
```
Another special type of character data are colours. Colours can be specified in three main ways: by name from those [available](http://bxhorn.com/r-color-tables/), by red, green, blue values using the `rgb` function, and by hue (colour), saturation (colour vs white) and value (colour/white vs black) using the `hsv` function. By default rgb and hsv expect three values in 0-1 with an optional fourth value for transparency. Alternatively, sets of predetermined colours with useful properties can be loaded from many different packages with [RColorBrewer](http://colorbrewer2.org/) being one of the most popular.
```{r}
reds = c("red", rgb(1,0,0), hsv(0, 1, 1))
reds
barplot(c(1,1,1), col=reds, names=c("by_name", "by_rgb", "by_hsv"))
```
### Logical
The `logical` class stores boolean truth values, i.e. TRUE and FALSE. It is used for storing the results of logical operations and conditional statements will be coerced to this class. Most other data-types can be coerced to boolean without triggering (or "throwing") error messages, which may cause unexpected behaviour.
```{r}
x = TRUE
class(x)
y = "T"
as.logical(y)
z = 5
as.logical(z)
x = FALSE
class(x)
y = "F"
as.logical(y)
z = 0
as.logical(z)
```
__Exercise 1__
Experiment with other character and numeric values, which are coerced to TRUE or FALSE? which are coerced to neither?
Do you ever throw a warning/error message?
### Factors
String/Character data is very memory inefficient to store, each letter generally requires the same amount of memory as any integer. Thus when storing a vector of strings with repeated elements it is more efficient assign each element to an integer and store the vector as integers and an additional string-to-integer association table. Thus, by default R will read in text columns of a data table as factors.
```{r}
str_vector = c("Apple", "Apple", "Banana", "Banana", "Banana", "Carrot", "Carrot", "Apple", "Banana")
factored_vector = factor(str_vector)
factored_vector
as.numeric(factored_vector)
```
The double nature of factors can cause some unintuitive behaviour. E.g. joining two factors together will convert them to the numeric form and the original strings will be lost.
```{r}
c(factored_vector, factored_vector)
```
Likewise if due to formatting issues numeric data is mistakenly interpretted as strings, then you must convert the factor back to strings before coercing to numeric values:
```{r}
x = c("20", "25", "23", "38", "20", "40", "25", "30")
x = factor(x)
as.numeric(x)
as.numeric(as.character(x))
```
To make R read text as character data instead of factors set the environment option `stringsAsFactors=FALSE`. This must be done at the start of each R session.
```{r}
options(stringsAsFactors=FALSE)
```
__Exercise__
How would you use factors to create a vector of colours for an arbitrarily long vector of fruits like `str_vector` above?
__Answer__
```{r, include=FALSE}
long_str_vector = c(str_vector, str_vector, str_vector)
fruit_cols = c("red", "yellow", "orange")
fruit_colour_vec = fruit_cols[as.numeric(factor(long_str_vector, levels=c("Apple", "Banana", "Carrot")))]
```
### Checking class/type
We recommend checking your data is of the correct class after reading from files:
```{r}
x = 1.4
is.numeric(x)
is.character(x)
is.logical(x)
is.factor(x)
```
## Basic data structures
So far we have only looked at single values and vectors. Vectors are the simplest data structure in R. They are a 1-dimensional array of data all of the same type. If the input when creating a vector is of different types it will be coerced to the data-type that is most consistent with the data.
```{r}
x = c("Hello", 5, TRUE)
x
class(x)
```
Here we tried to put character, numeric and logical data into a single vector so all the values were coerced to `character` data.
A `matrix` is the two dimensional version of a vector, it also requires all data to be of the same type.
If we combine a character vector and a numeric vector into a matrix, all the data will be coerced to characters:
```{r}
x = c("A", "B", "C")
y = c(1, 2, 3)
class(x)
class(y)
m = cbind(x, y)
m
```
The quotation marks indicate that the numeric vector has been coerced to characters. Alternatively, to store data with columns of different data-types we can use a dataframe.
```{r}
z = data.frame(x, y)
z
class(z[,1])
class(z[,2])
```
If you have set stringsAsFactors=FALSE as above you will find the first column remains characters, otherwise it will be automatically converted to a factor.
```{r}
options(stringsAsFactors=TRUE)
z = data.frame(x, y)
class(z[,1])
```
Another difference between matrices and dataframes is the ability to select columns using the `$` operator:
```{r, eval=FALSE}
m$x # throws an error
z$x # ok
```
The final basic data structure is the `list`. Lists allow data of different types and different lengths to be stored in a single object. Each element of a list can be any other R object : data of any type, any data structure, even other lists or functions.
```{r}
l = list(m, z)
ll = list(sublist=l, a_matrix=m, numeric_value=42, this_string="Hello World", even_a_function=cbind)
ll
```
Lists are most commonly used when returning a large number of results from a function that do not fit into any of the previous data structures.
## More information
You can get more information about any R commands relevant to these datatypes using by typing `?function` in an interactive session.