-
Notifications
You must be signed in to change notification settings - Fork 0
/
Federal_Election_Commitee_Contributions_Analysis.Rmd
297 lines (275 loc) · 16.7 KB
/
Federal_Election_Commitee_Contributions_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "2016 Presidential Contributions"
author: "Zohaib Anwar"
date: "August 12, 2017"
output: html_document
---
#Introduction#
In this study I look at data from the 2016 Presidential election by candidate, total contributions, location of, and number of contributors. For each of these variables I try to parse the data and create some visualizations that sum up the data well. I end up creating a statististical summary with bar and line plots, overview of the data by plotting, on a world map, some of the aforementioned varibles, and a general box plot to differentiate between gender and party in the election.
```{r}
setwd("C:/USers/Zohaib/Desktop/Lectures/Udacity/R")
#Setting the correct working directory.
Contributions<-read.csv("ContributionsEdited.csv",header = TRUE,na.strings="",
encoding="UTF-8") #Reading in the data.
```
```{r,message=FALSE}
library(plyr) #Setting up the libraries for the following code.
library(ggplot2)
library(gridExtra)
library(maps)
library(ggmap)
library(devtools)
library(dplyr)
```
```{r}
for (i in c("cand_nm","election_tp")){print(sort(table(Contributions[i]),
decreasing=TRUE))}
# Gives table counts for each candidate and election.
Contributions$names<-lapply(as.character(Contributions$cand_nm),
function(x) strsplit(x,",")[[1]][1])
Contributions$names<-factor(Contributions$names,levels=sort(
as.vector(as.character(unique((Contributions$names))))))
#Creates and easier to use name variable, with just last names.
```
As expected, the most number of the contributions are going to the popular choice, Clinton, the fundraising upset Sanders, and election winner Trump. For the rest of the individuals, one can get a sense of how the election turned out just by looking at the differing number of contributions.
```{r, fig.width=15}
plotCont<-subset(Contributions,election_tp=="P2016"|election_tp=="G2016") #Creates data set to focus on 2016 election.
p1<-ggplot(aes(x=names,y=contb_receipt_amt,group=1),
data=plotCont)+
geom_bar(aes(fill='red'),stat="summary",fun.y=mean)+ #bar plot for mean of contbributions by candidate
facet_wrap(~election_tp,nrow = 2)+ #split into the primary and general elections.
geom_point(alpha=.5,size=.75,stat="summary",fun.y=median)+
geom_line(color='purple',stat="summary",fun.y=median)+ #line plot for median contributions.
labs(x="Candidate Names",y="Mean Contribution Amount ($)")+
theme(plot.title = element_text(size=22))+
guides(fill=FALSE)
p2<-ggplot(aes(x=names,y=contb_receipt_amt/1000000,group=1),
data=plotCont)+
ylab("Contribution Total ($ mil)")+ #line plot for sum of contributions by candidate split by primary and
ggtitle("Contribution by Candidate")+ #general elections.
geom_line(size=1.25,color='steelblue',stat="summary",fun.y=sum)+
theme(axis.title.x=element_blank())+
facet_wrap(~election_tp,nrow=2)+
guides(fill=FALSE)
grid.arrange(p2,p1)
```
From this one can notice something odd about the mean contributions as some of the individuals you would expect to have high mean and/or median contributions do not! Clinton, Sanders, and Trump all have low amounts in comparison.The three, on the other hand, do obviously have the highest total contributions, but one should find it very odd that some of the prominent candidates had low statistics, while someone like Jindal had or Lessig both had very high amounts. Then there are the negative means (most likely due to reimbursements), and contributions in the general election for people that were not even in the race such as Lessig who has a high mean amount of contributions then. Yet, when one looks at the count for Lessign he only has 1339 contributions so these numbers are a little less concerning.
```{r}
ggplot(aes(x=contbr_occupation,y=contb_receipt_amt),
data=plotCont)+
geom_point(stat="summary",fun.y=mean)
ggplot(aes(x=contbr_city,y=contb_receipt_amt), #Two plots trying to create scatter plots by city and occupation versus
data=plotCont)+ #receipt amounts, but this obviously does not seem like the best plots.
geom_point(stat="summary",fun.y=mean)
```
In an attempt to determine some information from occupation of the contributors and cities, I plot these variables against the mean contribution amount for them but these plots do not really say anything, and it is near impossible to do anything worth-while with them.
```{r,eval=FALSE}
Contributions$Gender<-NA
Contributions$Party<-NA
Males=c("Rubio","Santorum","Perry","Carson","Cruz","Paul","Sanders","Huckabee",
"Pataki","O'Malley","Graham","Bush","Trump","Jindal","Christie",
"Walker","Webb","Kasich","Gilmore","Lessig","Johnson","McMullin")
Females=c("Clinton","Fiorina","Stein")
Republicans=c("Rubio","Santorum","Perry","Carson","Cruz","Fiorina","Paul","Huckabee",
"Pataki","O'Malley","Graham","Bush","Trump","Jindal","Christie",
"Walker","Kasich","Gilmore")
Democrats=c("Clinton","Sanders","Webb","Lessig")
Others=c("Stein","Johnson","McMullin")
#Creating Gender and Party variables based off of a list of the names that
#fall in each category.
for (i in Contributions$names){
if (i %in% Males) {
Contributions$Gender="M"
}
else if (i %in% Females) {
Contributions$Gender="F"
}
else {
Contributions$Gender=NA}}
for (i in Contributions$names){
if (i %in% Republicans){
Contributions$Party="R"
}
else if (i %in% Democrats) {
Contributions$Party="D"
}
else if (i %in% Others){
Contributions$Party="O"
}
else {
Contributions$Party=NA
}}
```
```{r}
for (i in unique(Contributions$gender)){
print(c(i,quantile(x=Contributions[
Contributions$gender==i,]$contb_receipt_amt,probs = .85)))
} #calculates 85th percentile for contributions by gender.
for (i in unique(Contributions$party)){
print(c(i,quantile(Contributions[
Contributions$party==i,]$contb_receipt_amt,probs = .85)))
} #calculates 85th percentile for contributions by party.
```
This data is just to get a sense of the data for the following plots, which need scaling manipulations.
```{r}
ggplot(aes(x=gender,y=contb_receipt_amt),data=plotCont)+
geom_boxplot(aes(alpha=.1))+
coord_cartesian(ylim=c(quantile(Contributions[
Contributions$gender=="F",]$contb_receipt_amt,probs = .25),
quantile(Contributions[
Contributions$gender=="F",]$contb_receipt_amt,
probs = .95)))+
guides(fill=FALSE)
ggplot(aes(x=party,y=contb_receipt_amt),data=plotCont)+
geom_boxplot(aes(alpha=.1))+
coord_cartesian(ylim=c(quantile(
Contributions[Contributions$party=="R",]$contb_receipt_amt,probs=.25),
quantile(
Contributions[
Contributions$party=="R",]$contb_receipt_amt,
probs=.95)))+
guides(fill=FALSE) #plotting boxplots for both contributions by gender and party with the 25th and 95th percentiles
#used to give a better picture of the plots.
```
In order to view differences across party and gender I create boxplots for them, one of the best ways to visually check. Here differences between gender and party are shown in terms of contributions, although, oddly the "Other" party types have a higher median amount of contributions when compared to Republicans and Democrats, although both of the latter have very high outliers throughout the plots. This is probably due to the lower count of contributions that was donated to the Other parties.
```{r}
namelist<-unique(Contributions$names)
for (i in namelist){
assign(i,subset(plotCont,names==i)%>% #Assigns each candidate name to a dataframe of cities and counts.
group_by(names,contbr_city,contbr_st) %>% #Grouping by city and state to get counts for each for each candidate.
summarize(n=n()) %>%
ungroup() %>%
ungroup())
i<-arrange(get(i),desc(n))
print(head(i,n = 10)) #printing tables of the top ten cities in terms of count of contributions from there.
}
```
I feel this is an extremely informative presentation of the data, as one can determine the top cities where each candidate received funds from. This information is highly valuable.Looking at just Clinton, she got the most from New York, D.C., and Los Angeles, Trump from HOuston, Dallas and Nevada (southern states) and Sanders from New York, Seattle, and Los Angeles.
```{r,message=FALSE,warning=FALSE}
for (i in namelist){a<-get(i)
a$location<-paste(as.character(a$city),as.character(
a$contbr_city))
a$lat<- sapply(a$location, #Getting the longitude and latitude for each
function(x) #candidate.
if (!is.na(x)){
geocode(x,source ="dsk",
messaging = FALSE)$lat}
else {NA})
a$lon<- sapply(a$location,
function(x)
if (!is.na(x)){
geocode(x,source ="dsk",
messaging = FALSE)$lon}
else {NA})
assign(i,a) #Assigning the new information to the candidate dataframes.
remove(a)
}
```
```{r}
for (i in namelist){
map("world", fill=TRUE, col="white", bg="lightblue", ylim=c(-60, 90),
mar=c(0,0,0,0))
points(x=get(i)$lon,y=get(i)$lat,col="red")
title(main=paste(i," Contributions"))}
#Plotting each candidate dataframe by city.
```
This visualization of contributions by city looks fine, but I believe this can be done in a somewhat more aesthetically pleasing way. In these first two attempts all candidate maps are displayed, just to allow for any viewer to be able to view the information for any candidate, as well as notice some of the striking differences such as Rubio
getting quite a few donations from Europe.
```{r,warning=FALSE}
for (i in namelist){
print(ggplot()+borders("world",colour="gray50",fill="gray50")+
geom_point(data=get(i),mapping=aes(x=lon,y=lat),
col='blue')+
ggtitle(paste(i," Contributions"))
#using borders and ggplot to plot the data.
)
}
```
This looks better but at this point, the data can definitely be fine-tuned using other variables such as the sum of contributions.
```{r}
for (i in namelist){
a<-get(i)
a$sum_of_cont<-(subset(plotCont,names==i)%>%
group_by(names,contbr_city,contbr_st) %>% #Again grouping by city and state, but to get sum of contributions
summarize(sum=sum(contb_receipt_amt), #by city this time.
n=n()) %>%
ungroup() %>%
ungroup())$sum
assign(i,a)
rm(a)
}
```
```{r,warning=FALSE,fig.width=15}
for (i in c("Clinton","Sanders","Trump")){
print(ggplot(aes(x=lon,y=lat),data=get(i))+borders("world",colour="gray50",fill="gray50")+
geom_point(alpha=.75,aes(col=get(i)$n,size=get(i)$sum_of_cont/1000))+
scale_color_gradient2(low = "red", mid = "white", high = "blue")) #Redoing the maps with size and color affected
} #by sum and count.
```
These plots definitely look a lot better and contain a lot more information, and I restrict the maps to only the main candidates!
#Final Visualizations
The summary for each candidate by mean, median, and sum of contributions, faceted by whether it was the primary or general elections was definitely a useful plot-- reproduced below:
```{r, fig.width=15}
p1<-ggplot(aes(x=names,y=contb_receipt_amt,group=1),
data=plotCont)+
geom_bar(aes(fill='red'),stat="summary",fun.y=mean)+ #bar plot for mean of contbributions by candidate
facet_wrap(~election_tp,nrow = 2)+ #split into the primary and general elections.
geom_point(alpha=.5,size=.75,stat="summary",fun.y=median)+
geom_line(color='purple',stat="summary",fun.y=median,size=1.25)+ #line plot for median contributions.
labs(x="Candidate Names",y="Mean Contribution Amount ($)")+
theme(axis.text.x = element_text(size=10),
axis.text.y = element_text(size=15),
axis.title = element_text(size=15),
strip.text = element_text(size=15))+
guides(fill=FALSE)
p2<-ggplot(aes(x=names,y=contb_receipt_amt/1000000,group=1),
data=plotCont)+
ylab("Contribution Total ($ mil)")+ #line plot for sum of contributions by candidate split by primary and
ggtitle("Contribution by Candidate")+ #general elections.
geom_line(size=1.25,color='steelblue',stat="summary",fun.y=sum)+
theme(plot.title = element_text(hjust=.5,size=22),axis.title.x=element_blank(),
axis.text.x = element_text(size=10),
axis.text.y = element_text(size=13),
axis.title=element_text(size=15),
strip.text = element_text(size=15))+
facet_wrap(~election_tp,nrow=2)+
guides(fill=FALSE)
grid.arrange(p2,p1)
```
The next plot I chose, as it is the next level of abstraction, shows the differences across parties.
```{r,message=FALSE,warning=FALSE}
ggplot(aes(x=party,y=contb_receipt_amt),data=plotCont)+
geom_boxplot(aes(alpha=.1))+
xlab("Party")+ylab("Contribution Amount")+ggtitle("Contributions by Party")+
theme(plot.title = element_text(hjust=.5,size=22),
axis.title = element_text(size=15),
axis.text = element_text(size=13),
legend.position = "none")+
#coord_cartesian(ylim=c(quantile(Contributions[Contributions$party=="R",]$contb_receipt_amt,probs=.25),
# quantile(Contributions[Contributions$party=="R",]$contb_receipt_amt,probs=.95)))+
scale_y_log10()
#plotting boxplots for both contributions by gender and party with the 25th
#and 95th percentiles used to give a better picture of the plots.
```
Finally, the map visualization, for the main candidates is a worthwhile display of the data and the differences across these candidates:
```{r,warning=FALSE,fig.width=10}
#Redoing the maps with size and color affected by sum and count.
for (i in c("Clinton","Sanders","Trump")){
print(ggplot(aes(x=lon,y=lat),data=get(i))+borders("world",colour="gray50",
fill="gray50")+
geom_point(aes(col=get(i)$n,size=get(i)$sum_of_cont/1000))+
scale_color_gradient2(low="red",mid="orange",high="yellow")+
ggtitle(paste(i," Contributions"))+theme(axis.title =
element_text(size=15),
axis.text = element_text(size=15),
plot.title = element_text(hjust=.5,size=22),
legend.title = element_text(size=12)
)+labs(color="Count",size="Sum of Contributions($Th.)")
)
}
```
All three candidates seem to have similar locations contributing to them. Sanders garners donors from more locations than Clinton, and Trump garners even more locations than either!! But, both Sanders and Clinton seem to have areas with more-so higher contributions, though obviously less populated.
Results:
According to the visualizations, even though Sander's campaign was deemed very popular, and was self-financing without the helps of PACs, other candidates still did much better than him in terms of donations! Furthermore, even lesser known, and barely in the race candidates, like Lessig, received high mean and median contributions-- higer than Trump, Sanders or Clinton which suggests that these candidates (though probably did get just as high donations as the other candidates) received a majority of low amount donations. Finally, it seems the three main candidates got donations from across the globe, yet, again, it seems Trump, due to his world-wide fame garnered even more (count-wise), something not many people expected at all.
One struggle I had while creating this was figuring out to display the city data, especially because of the large number of observations. In terms of coding this was solved by parsing the data rather than iterating over it, and visually solved by mapping mean, count and sum values. On the other hand, I also struggled with the data for occupation, which I feel is really valuable data, but much more-so difficult to make a visualization for. A possible visualization could just be a table, much like the ones created for cities, that contains the top ten occupations by candidate!
For improvements and further research, I would suggest attemping to do this for local elections, and then determining whether contributions actually affect winning rates (for Which the data here suggests they do not).