forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
148 lines (126 loc) · 5.23 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "Reproducible Research: Peer Assessment 1"
author: "Geoffrey Anderson"
date: "For project submission 11/16/2014"
output:html_document
---
## Introduction
```{r setoptions, echo=FALSE}
#opts_chunk$set(scipen=1,digits=4)
options(scipen=1,digits=4)
```
This study makes use of data from a personal activity monitoring device
which recorded every footstep taken by an anonymous person.
This device collected data at 5 minute intervals through out the day. Interval
value 500 is 5:00 a.m for example. The data consists of two months of data.
## Loading and processing the data
Load the data frame from CSV file:
```{r loaddata}
library(sqldf)
library(data.table)
library(ggplot2)
library(tcltk)
df <- read.csv('activity.csv')
summary(df)
str(df)
```
Remove rows containing missing values:
```{r removemissing}
ok <- complete.cases(df)
x <- df[ok,]
dim(df)
dim(x)
df <- x
```
## Mean total number of steps taken per day
Computing mean and median of total number of steps per day:
```{r computecenters}
dt <- data.table(df)
dailysteps <- dt[,sum(steps),by=date]
mn <- mean(dailysteps$V1)
md <- median(dailysteps$V1)
```
Excluding incomplete observations, the total number of steps per day mean is
`r mn` and median is `r md`.
Histogram of the total number of steps taken each day:
```{r histdailysteps}
qplot(dailysteps$V1,binwidth=2000, ylab='Number of days')
```
## Average daily activity pattern
Time series plot (i.e. type = "l") of the 5-second interval (x-axis) and the
average number of steps taken, averaged across all days (y-axis):
```{r timeseriestypicalday}
avgsteps <- dt[, mean(steps), by=interval]
plot(avgsteps$interval, avgsteps$V1, type='l',
xlab='Which 5-second interval of a day (500 is 5:00 a.m.)',
ylab='Average steps taken')
```
Which 5-second interval, on average across all the days in the dataset, contains
the maximum number of steps?
```{r computemaxinterval}
max(avgsteps$V1)
themax <- avgsteps[which(max(V1)==V1)]
print(themax)
```
The answer is interval `r themax[,interval]`, which agrees with the location of
the big spike on the time series plot above.
## Imputing missing values
This section imputes missing values by replacing values of NA with **interval means.**
It finds the rows having missing values.
It finds means of intervals by looking at the rows where nothing was missing.
It makes a new data frame by substituting the imputed mean values of steps into
the rows that had missing values.
It row-binds (unions) the good rows and the imputed rows together again.
It computes a new daily sum of steps on the re-unioned rows.
It shows the histogram of the new daily sum of steps.
```{r impute}
df <- read.csv('activity.csv')
dt <- data.table(df)
ok <- complete.cases(dt); #print(head(ok))
okdt <- dt[ok==TRUE,]; #print(dim(okdt))
baddt <- dt[ok==FALSE,]; #print(dim(baddt))
avgsteps <- okdt[, mean(steps), by=interval]
sqlexpr <- paste('select avgsteps.V1 as steps, baddt.date, baddt.interval ',
'from avgsteps, baddt ',
'where avgsteps.interval = baddt.interval', sep='')
fixeddt <- sqldf(sqlexpr)
reunioned <- data.table(rbind(okdt, fixeddt))
dailysteps <- reunioned[, sum(steps), by=date] # recompute on all obs together.
mni <- mean(dailysteps$V1); #print(mn)
mdi <- median(dailysteps$V1); #print(md)
qplot(dailysteps$V1, binwidth=2000, ylab='Number of days')
```
When including imputed values instead of skipping the cases having missing data,
the total number of steps per day mean is `r mni` and median is `r mdi`.
Compared to the same computations made on complete cases only, the mean changed
by `r mni-mn`, and median by `r mdi-md`.
The impact of imputing missing data on the estimates of the total daily number
of steps is, when the change expressed as a percentage, `r 100*(mni-mn)/mn` for
the mean and `r 100*(mdi-md)/md` for the median. There was a large growth in
height of the mode or highest bar in the histogram, which was about 15 frequency
then it became about 25 frequency after imputation. The other bars heights were
unchanged. The imputation assigned mean values where there had been missing
values, so a taller center bar makes good sense on the histogram of the imputed
data set.
## Differences in activity patterns between weekdays and weekends
Below is a graphic which I am adding, above and beyond the project requirement.
It overlays the weekend and weekday curve. In my opinion, this graphic most
effectively highlights where the weekday and weekend curves differ from each
other. Weekends have the most steps in the afternoon. Weekdays have the most
steps happening in the morning.
```{r weekends}
iswe <- weekdays(as.Date(reunioned$date)) %in% c('Saturday', 'Sunday')
w <- ifelse(iswe, 'weekend', 'weekday')
reunioned$weekend <- as.factor(w);
dt <- data.table(reunioned)
avgsteps <- dt[, mean(steps), by=list(weekend,interval)]; #print (avgsteps)
qplot(interval, V1, color=weekend, data=avgsteps, geom=c('line', 'smooth'),
method='loess',
xlab='Which 5-second interval of the day', ylab='Average number of steps')
```
Below is the project required graphic. It puts the weekend and weekday curves
each into its own graph.
```{r weekendsrequired}
qplot(interval, V1, facets=weekend~., data=avgsteps, geom=c('line'),
xlab='Which 5-second interval of the day', ylab='Average number of steps')
```