-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lecture 6 Factors.Rmd
133 lines (107 loc) · 2.85 KB
/
Lecture 6 Factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: "Lecture 6 Factors"
author: "David A McAllister"
date: "Monday 31st October, 2016"
output:
slidy_presentation: default
ioslides_presentation: default
---
## Objective of section
* To define a factor
* To describe the uses of a factor
* To create factors, add data to factors and change factor levels
* To combine factors together
## Factors
* R's way of handling labelled variables
* Useful in
* regression modelling
* tables
* ggplot graphs
* Awful in
* data scrubbing
* merging
## Create factor
All factors are created from character vectors
```{r, echo = FALSE}
my_char <- sample(c("a", "b","c","e"), 20, replace = TRUE)
my_char
```
* Values include a,b,c and e
* Assume "d" was possible (just not in this dataset yet)
```{r}
my_factor <- factor(my_char, levels =c("a","b","c","d","e"),
labels = c("A","B","C","D","E"))
str(my_factor)
```
## What happens
R-help "If x[i] equals levels[j], then the i-th element of the result is j."
```{r}
data.frame (original = my_char, levels = as.numeric (my_factor),
labels = as.character(my_factor))
```
## Modifying factors
```{r, eval =TRUE, warning = TRUE}
# Can't add new values
my_factor [3] <- "F"
# Even if originally present when made factor
my_factor [3] <- "d"
# Only existing levels
my_factor [3] <- "D"
```
## Create a new factor with different coding
New metadata
* elephant, buffalo, antelope and crocodile
* No donkeys
```{r}
# create new factor
my_factor2 <- factor (my_char, levels = c("e", "b", "a", "c"),
labels = paste ("new", c("E", "B","A","C")))
levels (my_factor2)
```
## Concatenating factors
In order to combine factors
* Must convert to character variables first
```{r}
naive <- c(my_factor, my_factor2)
correct <- c(as.character(my_factor), as.character(my_factor2))
#Tabulate - shows the problem
table(naive, correct)
```
## Changing reference level
Reference level
* Safe and easy, eg in regression modelling
```{r, echo = 2}
levels (my_factor)
my_factor <- relevel (my_factor, "C")
levels(my_factor)
```
## Collapsing
* but potentially dangerous
```{r, echo= 2}
my_vowels <- my_factor
levels (my_vowels) <- c("con", "vow", "con", "con", "vow")
table (my_vowels, my_factor)
as.numeric (my_factor) [1:10]
as.numeric (my_vowels) [1:10]
```
*Safest option is to use factor* levels and labels in same function
## Dropping levels
```{r}
my_factor <- factor (my_char)
table (my_factor)
table(my_factor[my_factor != "c"])
table(droplevels(my_factor[my_factor != "c"]))
```
## A factor is a numerical integer with an attribute
```{r}
attributes (my_factor)
attributes (matrix(1:9, nrow = 3))
```
```{r}
attributes (cars)
```
## Advice
* When importing data use `stringsAsFactors = FALSE`
* Used to have performance implications, not now
* Don't bother with ordered factors
* Use factors for regression modelling, explicitly set levels