-
Notifications
You must be signed in to change notification settings - Fork 0
/
seminar-1-introduction-to-R.Rmd
170 lines (107 loc) · 4.15 KB
/
seminar-1-introduction-to-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: "SOS2900"
subtitle: "Seminar 1: Introduction to R and prediction"
author: "Torkild H. Lyngstad"
date: "18 januar 2018"
output:
beamer_presentation:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
# What is R?
- R is a statistical programming language
- Statistical analysis
- Data visualization
- Data mining
- General programming
- It is open-source
- Free as in “free beer” and “free speech”
- Large active community around R
- Many contributed packages
# Why do you need to know R?
- R is very powerful, and allows us to manually compute
everything we want if we choose to
- For many of the methods we use, R has either the simplest or
the only implementation.
- A main tool among Data scientists
# What do you need to know about R?
- In this course not much.
- Tiny bits of programming
- Cookbook method
- Strongly advised to learn about R (or Python) for DS
# Let's get started
- First use of R and RStudio
- We will use the environment RStudio for our work in R
- What is the difference?
- R is the program interpreter. Does the job.
- RStudio is a "front end" that makes it easier to use R.
- RStudio requires R.
# Meet RStudio!
- RStudio has 4 panels:
- Console: This is the actual R window, you can enter commands here and execute them by pressing enter
- Source: This is where we can edit scripts. It is where you do most of your work
+ Control-enter sends *selected* code to R (in the console)
+ Control-shift-enter sends *entire* script to R (in the console)
- Plots/Help: This is where plots and help pages will be shown
- Workspace: Shows which objects you currently have, working directory contents
- Anything following a # symbol is treated as a comment!
# RStudio
![](./materials/seminar-1-rstudio.png)
# R as calculator
- You can use R as a calculator
![](./materials/seminar-1-r-calculator-1.png)
# In R, everything has a name
- Assign and use objects
- The <- operator can be used to store values into objects
- An object can contain anything in R
- R expressions that are not stored in an object are printed
- Object names are...
- Unique: If you use the same name, the old will be *overwritten*
- Sensitive: MyData is *not the same* as mydata
# We can use R as a calculator
![](./materials/seminar-1-r-calculator-1.png)
# Numeric and character data
- In R, there are several types of data:
- Numeric: numbers of all kinds
- Character: one of more text character (which may be a number!)
- Multiple characters in a row = strings
- Logical: TRUE or FALSE
# Numeric data
![](./materials/seminar-1-r-numbers.png)
# Character data
![](./materials/seminar-1-r-strings.png)
# Assigning stuff to objects
![](./materials/seminar-1-assignment.png)
# How does data get into R?
- We can type it into ourselves
- Not going to work with data sets of interesting sizes.
- Use *read.csv* or to read simple text files (or *data.table::fread*)
- Convert from other software (Excel, Stata, etc.) using R's *readxl* or *haven* libraries
# Data frames (or tibbles)
- Data sets is represented as *data frames*
- R can deal with many data sets at once
- Data frames stored under different names
- Stata/SPSS: One data set at the time
- For our purposes:
- Summarize one data set using regression (old data, "training data")
- Predict on *another data set* (new data, "testing data")
# Looking at a data frame
![](./materials/seminar-1-looking.png)
# More info about R for Data science
- Enormous amount of material on internet
- Look e.g. at this free book: http://r4ds.had.co.nz/
- A full course in using R for data science
- Many, many relevant books. Examples:
- Kosuke Imai: **Quantitative data analysis**
- Kieran Healy: **Data visualization for the social sciences**
# Nano data science!
- Let us do something REAL!
- Load data set, and store it in an object
- Look at data set
- Summarize data set
- Model data using linear regression
- Predict from model using same data
- Evaluate prediction
- Model again, but better