-
Notifications
You must be signed in to change notification settings - Fork 0
/
Answers.Rmd
247 lines (234 loc) · 9.4 KB
/
Answers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: "Answers_new"
author: "David McAllister"
date: "October 2016"
output: pdf_document
---
# Answers 2 Objects
```{r}
# Look at the following code, guess what kind of object this will create, then
# check what is created using typeof, str() or class() 1.
str(c(35, 46, 17, 58,2))
# 2.
str(c( 34, 1, "l", TRUE))
# 3.
str(c(T,F,T,F))
# 4.
class(cbind (c(1,0,0), c(0,1,0), c(0,0,1)))
# 5.
class(cbind (c("a", "b", "c"), c (1,2,3)))
mode (cbind (c("a", "b", "c"), c (1,2,3)))
# 6.
str(data.frame (var1 = 1:10, var2 = letters[1:10], var3 = rep(1:2,5), stringsAsFactors = FALSE ))
```
# Answers 3 Working with vectors
```{r}
set.seed (1234)
# 1. Create a vector a of 10 realisations from the standard normal distribution, use rnorm (mean = 0, sd = 1)
a <- rnorm (10, mean = 0, sd = 1)
# 2. Calculate max and min
max(a) ; min (a)
# 3. Create a vector b from a standard normal distribution
b <- rnorm (10, mean = 0, sd = 1)
# 4. Take the maximum of a and b for each element
pmax (a, b); pmin (a,b)
# 5. Identify the number of elements where a > b
sum (a>b)
# R created a logical vector of length = length (a) (and length b) where each
# element of the new vector was TRUE if the corresponding element of a was
# greater than the corresponding element of b. It then coerced the logical
# vector to an integer vector and summed the values.
```
# Answers 4 Subsetting vectors
```{r}
set.seed (1234)
# 1. Create a vector, x from a poisson distribution of length 10, with lambda = 5, use rpois
x <- rpois (10, lambda = 5)
x
# 2. Select the first and last elements using an integer vector
new_x <- x[c(1L,length(x) )] # could have used a double vector, R would have coerced
print(new_x)
new_x
# 3. select all but the first element
x[-1]
# 4. Return the vector according to the numerical order
sort (x)
x[order(x)]
# Order returns indices for the ordering. These can then be used to order any
# object (of the same length), such as another vector, a dataframe etc. sort is
# only useful for that particular vector.
# 5. Use logical subsetting to select all values of x which are >5
x [x >5]
# 6. In a single line, identify the minimum value of x which is greater than 3
min (x[x>3])
# Create the vector b
b <- c(a = 1, b = 2, c = 3, d = 4, e = 5)
b
# 7. Return a vector from these elements repeating the first and second element
# 3 times and twice respectively, using the names attribute of the vector
b [c(rep("a",3), rep("b",2))]
# 8. Assign NA to all values of x >5
x [x>5] <- NA
x
# 9. Add 0.5 to any values less than 5, HINT NEED TO USE is.na and "&" operator
x [x<5 & !is.na(x)] <- x[x<5 & is.na(x) == FALSE] + 0.5
x
```
# Answers 5 Subsetting other objects
1. Run the following and answer the following questions
+ a list
+ names and the class lm. The latter means that plot.lm, summary.lm will work on it
+ 13 objects within a list
+ you get a character vector of length 1222
2. Extract the coefficients for the model using the coef(model1) function and using indexing
+ coef(model1)
+ model1$coef
3. Extract the Species factor from the model so you get an atomic vector which is a factor
+ `model1$model$Species` and `model1[["model"]][["species"]]` both give a vector (a factor).
+ anything else will give you a list or nothing
+ The following is an example of getting a list from a list
```{r, eval = FALSE}
b <- model1["model"] # this is a list of length 1 containing a dataframe
b[[1]][,2] # this extacts the second column from the dataframe nested within a list
```
4. Turn the swiss dataset into a matrix `swiss.m <- as.matrix (swiss)`
5. Select all first and third columns and second and fourth rows, use 2 methods
```{r, echo = -1}
swiss_m <- as.matrix (swiss)
swiss_m [c(2,4), c(1,3)]
swiss_m [c("Delemont", "Moutier"), c("Fertility", "Examination")]
```
6. Select a single row, first in a way which produces a vector then in a way which produces another matrix. What attributes persist with each method?
```{r}
swiss_m [1,,drop = TRUE] # column names
swiss_m [1,,drop = FALSE] # dimensions, column names and rownames
```
7. Calculate the count, mean and sum of all the values are between 25 and 30 (inclusive)
```{r}
mean (swiss_m[swiss_m >=25 & swiss_m <=30]) # etc, or with less repetition
quickfx <- function (x) c(length = length(x), mean = mean(x), sum = sum(x))
quickfx(swiss_m[swiss_m >=25 & swiss_m <=30])
# Later on we will see how we can pass a list of functions.
```
8. Select all the rows for towns with Rive in the names (use the grepl and rownames functions)
```{r}
swiss [grepl("Rive", rownames(swiss)), ]
```
9. remove all objects from workspace and create swiss as a matrix and a dataframe
```{r}
data(swiss)
swiss_m <- as.matrix(swiss)
swiss_df <- swiss
```
+ Use the following indexing, which gives same and different results
+ [[1]] yields different results, matrix - first element of vector, df - double atomic vector
+ [1] yields different results, matrix - first element of vector, df - one column dataframe
# Answers 6 working with factors
1. load in the esoph data , data (esoph) (not really necessary but in case modfied it) and look at the help file
```{r, eval = 1}
data (esoph)
?esoph
```
2. what class is each of the variables
```{r, eval = TRUE}
lapply (esoph, class)
```
3. Repace alcohol consumption "120+" with "120 to 150"
+ Using indexing with a logical test
```{r}
esoph$alcgp2 <- as.character(esoph$alcgp)
esoph$alcgp2 [esoph$alcgp2 == "120+"] <- "120 to 150"
```
+ using ifelse
```{r}
a<- ifelse(esoph$alcgp == "120+", "120 to 150", esoph$alcgp)
b<- ifelse(esoph$alcgp == "120+", "120 to 150", as.character(esoph$alcgp))
table(a,b)
```
+ In the first case R treates the replacement vector as an integer. In the second we coerce it to character vector.
4. Dichotomise alcohol consumption into >=80 or <80
```{r}
esoph$alc_dichot <- esoph$alcgp %in% c("80-119", "120+")
table(esoph$alc_dichot, esoph$alcgp)
```
5. Run a regression model on:
```{r, eval = FALSE}
model1 <- glm(cbind(ncases, ncontrols) ~ tobgp, data = esoph, family = binomial())
model2 <- glm(cbind(ncases, ncontrols) ~ factor(tobgp, ordered = FALSE), data = esoph, family = binomial())
```
+ what has R done differently?
+ Default glm behaviour with an ordered factor is to fit it as a linear, quadratic and cubic term
+ Default glm behaviour with an unordered factor is to create indicator variables in the design matrix
# Answers 7 Understanding functions
1. For c, summary, mean, glm, plot examine the help menus
![alt text](figures/answertable.png)
+ If you see a function is generic in the help menu, you can use methods() to access it. eg methods(generic.function = "plot"). You can also search to see which generic functions exist for a class, eg methods(class = "glm"). See ?methods for accessing help and the function's source codes.
6. What will R do if a function argument is not supplied and there is no default?
Nothing, unless it needs the argument for it's calculations
```{r, eval = FALSE}
myfx <- function (x,y,z) {x + y}
myfx (1,2)
myfx2 <- function (x,y,z) {x + y + z}
myfx2(1,2) # gives an error
```
3. What values will `print (c(a = a, b = b))` result in. Guess before running.
```{r, eval = TRUE}
a <- 1
b <- 4
MyFunction <- function (a, b) {
a <- a + 1
b <<- b + 1
}
MyFunction (a = 2, b = 6)
print (c(a = a, b = b))
```
# Answers 8 apply family
```{r, warning = TRUE}
# 1. In the iris dataset use tapply to calculate the mean Sepal length according to Species
tapply(iris$Sepal.Length, iris$Species, mean)
# 2. Use lapply to determine the class of each variable and then the mean of each variable. How does the R function lapply handle the character vector?
lapply (iris, class)
lapply (iris, mean)
mean(iris$Species)
# R returns an NA for the character vector
#3. For the double vectors, use apply
# to calculate row sums and column sums and compare the results with rowSums and
# colSums
iris <- iris[, sapply(iris, is.numeric) ]
## Note could have subsetted using names(iris) != "Species"
#apply (iris, 1, sum) # commented out as length = 150
apply (iris, 2, sum)
#rowSums (iris) # commented out as length = 150
colSums (iris)
```
# Answers 9 writing functions
1. Look at the esoph dataset type ?esoph into R
2. Summarise the total number of cases and controls by age group
```{r}
tapply (esoph$ncases + esoph$ncontrols, esoph$agegp, sum)
```
3. Create a variable where the value is true for heavy smokers, defined as >19 gm per day of tobacco
```{r}
esoph$heavy_smoker <- esoph$tobgp %in% c("20-29", "30+")
```
4. Check that there are heavy smokers in each age group
```{r}
tapply (esoph$ncases + esoph$ncontrols, list( esoph$heavy_smoker, esoph$agegp), sum)
```
5. Run a logistic regression model of oesophageal cancer on heavy smoking for the whole dataset (have a look at the bottom of the ?esoph page for how to do this with aggregated data)
```{r}
model1 <- glm (cbind(ncases, ncontrols) ~ heavy_smoker, data = esoph, family = binomial)
```
6. Write a function which will extract the odds ratio for smoking > 19gm per day for each age group stratum.
```{r}
ExtractOR <- function (x) {
model1 <- glm (cbind(ncases, ncontrols) ~ heavy_smoker, data = x, family = binomial)
model1$coef["heavy_smokerTRUE"]
}
ExtractOR(esoph [ esoph$agegp == "35-44", ])
by (esoph, esoph$agegp, ExtractOR)
```
7. Why couldn't we use `tapply`? look at the help for the first argument in `tapply` and `by`
+ tapply only takes a single atomic vector and we needed to work with three vectors.
8. What kind of object does `by` produce?
+ it depends. This is one of the reasons many people prefer plyr or dplyr.