-
Notifications
You must be signed in to change notification settings - Fork 9
/
11-capstone_solutions.html
263 lines (263 loc) · 22 KB
/
11-capstone_solutions.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Capstone Project</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Capstone Project</h1>
<p>This webpage is the product of an RMarkdown document. Most of the R code used to produce it is viewable here, but if you would like to see the raw .Rmd file to see the use of inline code, RMarkdown options, etc., you can view or download the source <a href="https://raw.githubusercontent.com/data-lessons/gapminder-R/gh-pages/11-capstone_solutions.Rmd">here</a>.</p>
<h1 id="setup">Setup</h1>
<p>Set some global document properties and load the data and some useful packages.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
<span class="kw">library</span>(Lahman)
<span class="kw">data</span>(<span class="st">'Batting'</span>)
<span class="kw">data</span>(<span class="st">'Salaries'</span>)</code></pre></div>
<h4 id="getting-acquanited">Getting acquanited</h4>
<blockquote>
<p>Explore the two data.frames. Write a short summary. What time periods do they cover? How many players are in the dataset? What is the maximum recorded salary?</p>
</blockquote>
<p>We can calculate all of these statistics directly in the text of our writeup…</p>
<p>The batting data range from 1871 - 2015, while the salary data start at 1985.</p>
<!-- There are 223251 missing data points in the Batting dataset, which represents 10% of the cells in the table. -->
<p>There are 18659 players in the batting dataset, and 4963 in the salary dataset.</p>
<p>The maximum salary earned is $33000000, which was earned by rodrial01 in 2009.</p>
<h4 id="batting-averages">Batting averages</h4>
<p>To approximate a player’s batting average we divide their number of at bats by the number of hits they got. Then let’s glance at the top of our data.frame to make sure the numbers look reasonable.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">Batting <-<span class="st"> </span><span class="kw">mutate</span>(Batting, <span class="dt">BA_approx =</span> H /<span class="st"> </span>AB)
<span class="kw">head</span>(<span class="kw">select</span>(Batting, H, AB, BA_approx))</code></pre></div>
<pre><code>## H AB BA_approx
## 1 0 4 0.0000000
## 2 32 118 0.2711864
## 3 40 137 0.2919708
## 4 44 133 0.3308271
## 5 39 120 0.3250000
## 6 11 49 0.2244898</code></pre>
<p>Using the <code>battingStats</code> function that comes with the <code>Lahman</code> package, we can calculate players’ actual batting averages (accounting for things like at-bats where the batter got on base without getting a hit, for example).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">Batting <-<span class="st"> </span><span class="kw">battingStats</span>(Batting)</code></pre></div>
<p>Let’s take a look at how our approximation of batting average compares with the actual statistic. We’ll draw a 1:1 line for comparison.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(Batting, <span class="kw">aes</span>(<span class="dt">x =</span> BA, <span class="dt">y =</span> BA_approx)) +<span class="st"> </span>
<span class="st"> </span><span class="kw">geom_point</span>() +<span class="st"> </span>
<span class="st"> </span><span class="kw">geom_abline</span>(<span class="dt">slope =</span> <span class="dv">1</span>, <span class="dt">intercept =</span> <span class="dv">0</span>, <span class="dt">color =</span> <span class="st">'red'</span>)</code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/BA%20approximation%20vs%20actual-1.png" alt="plot of chunk BA approximation vs actual" />
<p class="caption">plot of chunk BA approximation vs actual</p>
</div>
<p>Wow, it is very close! Are they exactly the same? There are a few ways we could look at that. Let’s plot the distribution of ratios of the statistics. Where that ratio is one, they are the same.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(Batting, <span class="kw">aes</span>(<span class="dt">x =</span> BA /<span class="st"> </span>BA_approx)) +<span class="st"> </span>
<span class="st"> </span><span class="kw">geom_density</span>(<span class="dt">fill =</span> <span class="st">'lightblue'</span>) </code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/BA%20density-1.png" alt="plot of chunk BA density" />
<p class="caption">plot of chunk BA density</p>
</div>
<h4 id="batting-averages-over-time">Batting averages over time</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(Batting, <span class="kw">aes</span>(<span class="dt">x =</span> yearID, <span class="dt">y =</span> BA)) +
<span class="st"> </span><span class="kw">geom_point</span>() </code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/BA%20over%20time-1.png" alt="plot of chunk BA over time" />
<p class="caption">plot of chunk BA over time</p>
</div>
<p>What a mess! There is so much over-plotting we can’t see where there is more data versus less. Also, the fact that there are so many 0, 0.5, and 1.0 entries suggests there are many data points with a small number of at bats. Let’s filter to only those players who had at least 50 at bats, add some transparency to the points, and fit a linear trend line.</p>
<p>It looks like batting averages have stayed pretty steady over time.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">filter</span>(Batting, AB ><span class="st"> </span><span class="dv">50</span>) %>%
<span class="st"> </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(<span class="dt">x =</span> yearID, <span class="dt">y =</span> BA)) +
<span class="st"> </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> .<span class="dv">1</span>) +
<span class="st"> </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">'lm'</span>)</code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/filtered%20BA%20over%20time-1.png" alt="plot of chunk filtered BA over time" />
<p class="caption">plot of chunk filtered BA over time</p>
</div>
<h4 id="home-run-kings">Home run kings</h4>
<p>Here are the top-five career home run hitters. Calculating number of seasons with <code>n()</code> will get close, but if a player played for multiple teams within a year, they will have multiple entries for that year.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">group_by</span>(Batting, playerID) %>%
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">careerHR =</span> <span class="kw">sum</span>(HR),
<span class="dt">seasons =</span> <span class="kw">length</span>(<span class="kw">unique</span>(yearID))) %>%
<span class="st"> </span><span class="kw">arrange</span>(<span class="kw">desc</span>(careerHR)) %>%
<span class="st"> </span><span class="kw">head</span>(<span class="dt">n =</span> <span class="dv">5</span>)</code></pre></div>
<pre><code>## # A tibble: 5 × 3
## playerID careerHR seasons
## <chr> <int> <int>
## 1 bondsba01 762 22
## 2 aaronha01 755 23
## 3 ruthba01 714 22
## 4 mayswi01 660 22
## 5 rodrial01 654 20</code></pre>
<blockquote>
<p>What is the most home runs in a single season?</p>
</blockquote>
<p>If you are willing to ignore the fact that some players play for more than one team in a year (and so have multiple entries in one season), you can just sort the data.frame by HR (descending).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">arrange</span>(Batting, <span class="kw">desc</span>(HR)) %>%
<span class="st"> </span><span class="kw">select</span>(playerID, yearID, HR) %>%
<span class="st"> </span><span class="kw">head</span>(<span class="dt">n =</span> <span class="dv">10</span>)</code></pre></div>
<pre><code>## playerID yearID HR
## 1 bondsba01 2001 73
## 2 mcgwima01 1998 70
## 3 sosasa01 1998 66
## 4 mcgwima01 1999 65
## 5 sosasa01 2001 64
## 6 sosasa01 1999 63
## 7 marisro01 1961 61
## 8 ruthba01 1927 60
## 9 ruthba01 1921 59
## 10 foxxji01 1932 58</code></pre>
<p>To avoid having to make that assumption, we need to group multiple rows for the same player in the same year, and add the number of home runs. In order to get <code>arrange</code> working properly, we need to remove the grouping; otherwise, it will try to arrange within the groups. Here though, the assumption doesn’t do any harm.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">group_by</span>(Batting, playerID, yearID) %>%
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">yearHR =</span> <span class="kw">sum</span>(HR)) %>%
<span class="st"> </span><span class="kw">ungroup</span>() %>%
<span class="st"> </span><span class="kw">arrange</span>(<span class="kw">desc</span>(yearHR)) %>%
<span class="st"> </span><span class="kw">head</span>(<span class="dt">n =</span> <span class="dv">10</span>)</code></pre></div>
<pre><code>## # A tibble: 10 × 3
## playerID yearID yearHR
## <chr> <int> <int>
## 1 bondsba01 2001 73
## 2 mcgwima01 1998 70
## 3 sosasa01 1998 66
## 4 mcgwima01 1999 65
## 5 sosasa01 2001 64
## 6 sosasa01 1999 63
## 7 marisro01 1961 61
## 8 ruthba01 1927 60
## 9 ruthba01 1921 59
## 10 foxxji01 1932 58</code></pre>
<h4 id="batting-salaries">Batting & Salaries</h4>
<blockquote>
<p>We want to examine how batting ability relates to salaries earned. For only the entries where salary data is available, join the two data.frames.</p>
</blockquote>
<p>To keep only the entries that have a salary, either use <code>left_join</code> and make Salaries the first table, or use <code>right_join</code> and make Batting the first table.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">battingSalaries <-<span class="st"> </span><span class="kw">right_join</span>(Batting, Salaries, <span class="st">"playerID"</span>, <span class="st">"yearID"</span>)</code></pre></div>
<blockquote>
<p>The three components of the batting triple crown are batting average, runs batted in (RBI), and home runs. Plot salary against each of the three statistics. Which appears to have the strongest relationship with a player’s salary?</p>
</blockquote>
<p>There are two ways to do this; either make a separate plot for each of the three statistics (we’ll write a function to do that to avoid typing the whole thing three times), or tidy the data across statistics, making one column for the name of the statistic and another for its value, and then faceting the plot by statistic. Here’s how to do each:</p>
<h5 id="separate-plots">Separate plots</h5>
<p>Let’s write a function that takes the <code>battingSalaries</code> data.frame and a batting statistic and plots the statistic against salary earned, with some transparency to help with over-plotting, and add a linear trend line.</p>
<ul>
<li>There is one tricky new thing here. When you use <code>ggplot</code> inside a function and you want to use a variable name (e.g. RBI) as an argument to the function, you have to use <code>aes_string</code> instead of <code>aes</code> inside <code>ggplot</code>, and you put any column names that you’re using directly in quotes. That is, if the variable you want to map to x, or y, or color, or whatever is itself a variable that contains the name of a column in the data.frame, rather than the name of a column in the data.frame, you use <code>aes_string</code>. Inside <code>aes_string</code> don’t use quotes for variables that contain column names, but do use quotes for column names. The mechanism is beyond the scope of this workshop, but it has to do with “non-standard evaluation.” If you want to know more, Google that phrase along with ggplot. The same thing applies to using <code>dplyr</code> within functions: You can use <code>filter_</code>, <code>arrange_</code>, etc.</li>
</ul>
<p>All three statistics appear to be associated with higher salaries.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">plotVsSalary <-<span class="st"> </span>function(statistic, <span class="dt">df =</span> battingSalaries) {
<span class="kw">ggplot</span>(df, <span class="kw">aes_string</span>(<span class="dt">x =</span> statistic, <span class="dt">y =</span> <span class="st">'salary'</span>)) +
<span class="st"> </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> .<span class="dv">1</span>) +
<span class="st"> </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">'lm'</span>)
}
<span class="kw">plotVsSalary</span>(<span class="dt">statistic =</span> <span class="st">'BA'</span>)</code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/batting%20stats%201-1.png" alt="plot of chunk batting stats 1" />
<p class="caption">plot of chunk batting stats 1</p>
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plotVsSalary</span>(<span class="dt">statistic =</span> <span class="st">'RBI'</span>)</code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/batting%20stats%201-2.png" alt="plot of chunk batting stats 1" />
<p class="caption">plot of chunk batting stats 1</p>
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plotVsSalary</span>(<span class="dt">statistic =</span> <span class="st">'HR'</span>)</code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/batting%20stats%201-3.png" alt="plot of chunk batting stats 1" />
<p class="caption">plot of chunk batting stats 1</p>
</div>
<h5 id="faceted-plots">Faceted plots</h5>
<p>Alternatively, you could use <code>tidyr</code>’s <code>gather</code> to put all the batting statistics in one column, and then facet by the batting statistic. In this case, use the <code>scales</code> argument to <code>facet_wrap</code> to get avoid plotting batting averages (0 - 1) on the same scale as home runs (0 - 73), etc.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyr)
<span class="kw">gather</span>(battingSalaries, stat, value, BA, RBI, HR) %>%
<span class="st"> </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(<span class="dt">x =</span> value, <span class="dt">y =</span> salary)) +<span class="st"> </span>
<span class="st"> </span><span class="kw">geom_point</span>(<span class="dt">alpha =</span> .<span class="dv">1</span>) +
<span class="st"> </span><span class="kw">geom_smooth</span>(<span class="dt">method =</span> <span class="st">'lm'</span>) +
<span class="st"> </span><span class="kw">facet_wrap</span>(~<span class="st"> </span>stat, <span class="dt">scales =</span> <span class="st">'free_x'</span>)</code></pre></div>
<div class="figure">
<img src="fig/capstoneSolutions/batting%20stats%202-1.png" alt="plot of chunk batting stats 2" />
<p class="caption">plot of chunk batting stats 2</p>
</div>
<blockquote>
<p>Run a multiple linear regression of salary on the three batting statistics. Are the results of the model consistent with the conclusions from your plots?</p>
</blockquote>
<p>No! After accounting for home runs, it looks like batting average and RBI are <em>negatively</em> associated with salary! Of course, there are many factors to consider before concluding that players should start getting fewer hits if they want to make more money (e.g. filtering players with very few at-bats or weighting data-points by the number of at bats, accounting for clustering at the team level, filtering pitchers who are paid for skills other than batting, etc.)</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">lm</span>(salary ~<span class="st"> </span>HR +<span class="st"> </span>BA +<span class="st"> </span>RBI, <span class="dt">data =</span> battingSalaries) %>%
<span class="st"> </span><span class="kw">summary</span>()</code></pre></div>
<pre><code>##
## Call:
## lm(formula = salary ~ HR + BA + RBI, data = battingSalaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10039137 -1864384 -1344822 653037 31130406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2146809.8 15994.8 134.219 < 2e-16 ***
## HR 120590.2 2041.3 59.076 < 2e-16 ***
## BA -1329219.2 71245.6 -18.657 < 2e-16 ***
## RBI -3027.5 619.2 -4.889 1.01e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3684000 on 231466 degrees of freedom
## (62525 observations deleted due to missingness)
## Multiple R-squared: 0.07447, Adjusted R-squared: 0.07446
## F-statistic: 6208 on 3 and 231466 DF, p-value: < 2.2e-16</code></pre>
<h4 id="advanced-triple-crown-winners">Advanced: triple crown winners</h4>
<blockquote>
<p>To win the triple crown is to have the most home runs and RBI and the highest batting average in a league for a year. Since 1957, only batters with at least 502 at-bats are eligible for the highest batting average. There have been three triple crown winners since 1957 – can you identify them?</p>
</blockquote>
<p>This might be a good candidate for the type of problem where it’s useful to map what you want to do before you start writing code. Here is one way to attack this problem:</p>
<ol style="list-style-type: decimal">
<li>Filter to the eligible players and years of interest (<code>filter</code>)</li>
<li>Group by year and league and identify the maximal values for each of the three statistics in each of the groups (<code>group_by %>% summarize</code>)</li>
<li>Add columns for the maximal values to the original data.frame (<code>X_join</code>)</li>
<li>Filter to rows where the player’s value equals the maximal value for each of the three statistics (<code>filter</code>)</li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">best =<span class="st"> </span>
<span class="st"> </span><span class="kw">filter</span>(Batting, AB >=<span class="st"> </span><span class="dv">502</span>, yearID >=<span class="st"> </span><span class="dv">1957</span>) %>%
<span class="st"> </span><span class="kw">group_by</span>(yearID, lgID) %>%
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">topBA =</span> <span class="kw">max</span>(BA),
<span class="dt">topRBI =</span> <span class="kw">max</span>(RBI),
<span class="dt">topHR =</span> <span class="kw">max</span>(HR))
withBest =<span class="st"> </span><span class="kw">right_join</span>(Batting, best, <span class="dt">by =</span> <span class="kw">c</span>(<span class="st">"yearID"</span>, <span class="st">"lgID"</span>))
<span class="kw">filter</span>(withBest, BA ==<span class="st"> </span>topBA &<span class="st"> </span>RBI ==<span class="st"> </span>topRBI &<span class="st"> </span>HR ==<span class="st"> </span>topHR)</code></pre></div>
<pre><code>## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 1 robinfr02 1966 1 BAL AL 155 576 122 182 34 2 49 122 8 5
## 2 yastrca01 1967 1 BOS AL 161 579 112 189 31 4 44 121 10 8
## 3 cabremi01 2012 1 DET AL 161 622 109 205 40 0 44 139 4 1
## BB SO IBB HBP SH SF GIDP BA_approx BA PA TB SlugPct OBP OPS
## 1 87 90 11 10 0 7 24 0.3159722 0.316 680 367 0.637 0.410 1.047
## 2 91 69 11 4 1 5 5 0.3264249 0.326 680 360 0.622 0.418 1.040
## 3 66 98 17 3 0 6 28 0.3295820 0.330 697 377 0.606 0.393 0.999
## BABIP topBA topRBI topHR
## 1 0.300 0.316 122 49
## 2 0.308 0.326 121 44
## 3 0.331 0.330 139 44</code></pre>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>