-
Notifications
You must be signed in to change notification settings - Fork 0
/
VGChartzExploration.rmd
2374 lines (1973 loc) · 94.5 KB
/
VGChartzExploration.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
P4: VGChartz Data Exploration in R by Whitney King
========================================================
---
author: Whitney King
output:
html_document
---
## Data Exploration in R
#### Dataset: VGChartsTop 10,000 Best Selling Video Games Globally
Data obtained on 5/22/2017
## Introduction
This data exploration will take an in depth look at data scraped from the
VGChartz.com charts for regional and global video games sales (by millions
of units). This data was obtained on 5/22/2017 using a Python3 script, and
importing BeautifulSoup to parse out the HTML data.
After the dataset was scraped from the table on
[VGChartz website](http://www.vgchartz.com/gamedb/), it was then limited
to the top 10,000 rows, and formatted using a dataframe before being output
to CSV for use in R. I've opted to format and color all charts, since they
are easier for me to read that way.
The goal of this exploration will be to gain a general understanding of
which features of video games are correlated to success in different
regions around the globe and across time. It will also look at platforms,
and how they're related to the ebb and flow of video games.
### Motivating Questions
* Which region has the most game units shipped?
* Is the popularity of video games still on the rise?
* Which game platform has been the most successful?
### Data Overview
The dataset is structured as follows:
* **Rank** *(num)*;
Primary Key/Unique Identifier for each game in the list,
ranked best-selling to worst-selling
* **Name** *(factor)*;
Title of the video game
* **Platform** *(factor)*;
Console/Platform game was released on
* **Year** *(num)*;
Year game was released
* **Genre** *(factor)*;
Genre/Category of the game title
* **Publisher** *(factor)*;
Publisher of the video game
* **NA_Sales** *(num)*;
Sales in millions of units in North America
* **EU_Sales** *(num)*;
Sales in millions of units in Europe
* **JP_Sales** *(num)*;
Sales in millions of units in Europe
* **Other_Sales** *(num)*;
Sales in millions of units in other regions of the globe
* **Global_Sales** *(num)*;
Total sales in units globally
Columns being generated are:
* **Decade** *(factor)*;
Decade the game was released
* **Franchise** *(factor)*;
Name of the franchise the game is from
* **Company_Name** *(factor)*;
Company that built the game console the game was published on
### Limitations
Since this data was collected from VGChartz, it is not an authoritative list
of all games ever released. Certain factors may prevent a game from landing on
the VGChartz charts, however most highly publicized titles are in the dataset.
This exploration is concerned with the top 10k bestselling games of all time,
this list is pretty comprehensive and a good source to investigate.
Additionally, the counts are taken as sales figures (Global_Sales, NA_Sales,
etc.), however this is a bit of a misnomer, since some games come free with
other purchases, or are given away but still count towards a unit. Due to this
factor, the numbers will mostly be talked about as units shipped instead of
units sold, though it can be assumed most titles sold copies instead of gave
them away.
```{r global_options, include=FALSE}
#suppress the warnings and other messages from showing in the knitted file.
knitr::opts_chunk$set(fig.width=8, fig.height=6, fig.path='Figs/',
echo=TRUE, warning=FALSE, message=FALSE)
```
```{r Load_Packages, echo=FALSE, message=FALSE, warning=FALSE}
# install packages
#install.packages("knitr")
#install.packages("rmarkdown")
#install.packages('dplyr', repos = "http://cran.us.r-project.org")
#install.packages('ggplot2', repos = "http://cran.us.r-project.org")
#install.packages('corrplot', repos = "http://cran.us.r-project.org")
#install.packages('ggcorrplot', repos = "http://cran.us.r-project.org")
#install.packages('PerformanceAnalytics',
# repos = "http://cran.us.r-project.org")
#install.packages('GGally', repos = "http://cran.us.r-project.org")
#install.packages('ggthemes', dependencies = TRUE,
# repos = "http://cran.us.r-project.org")
#install.packages('Hmisc', repos = "http://cran.us.r-project.org")
#install.packages('plotly', repos = "http://cran.us.r-project.org")
#install.packages('gridExtra', repos = "http://cran.us.r-project.org")
#install.packages('reshape2', repos = "http://cran.us.r-project.org")
#install.packages('alr3', repos = "http://cran.us.r-project.org")
#install.packages('tidyr', repos = "http://cran.us.r-project.org")
#install.packages('psych', repos = "http://cran.us.r-project.org")
library(knitr)
library(rmarkdown)
library(ggplot2)
#library(psych)
library(corrplot)
library(ggcorrplot)
library(GGally)
library(PerformanceAnalytics)
library(ggthemes)
library(Hmisc)
library(plotly)
library(dplyr)
library(gridExtra)
library(data.table)
library(reshape2)
library(alr3)
library(tidyr)
theme_set(theme_minimal(10))
```
```{r echo=FALSE, Load_the_Data}
# load the Data
vgdata <- read.csv('vgsales.csv')
#subset(vgdata, Genre == 'Puzzle') # test preview data
vgdata$Rank <- NULL #not needed
vgdata$Name <- as.factor(vgdata$Name)
vgdata$Platform <- as.factor(vgdata$Platform)
vgdata$Genre <- as.factor(vgdata$Genre)
vgdata$Publisher <- as.factor(vgdata$Publisher)
vgdata$Year <- as.numeric(as.character(vgdata$Year))
vgdata$NA_Sales <- as.numeric(as.character(vgdata$NA_Sales))
vgdata$EU_Sales <- as.numeric(as.character(vgdata$EU_Sales))
vgdata$JP_Sales <- as.numeric(as.character(vgdata$JP_Sales))
vgdata$Other_Sales <- as.numeric(as.character(vgdata$Other_Sales))
vgdata$Global_Sales <- as.numeric(as.character(vgdata$Global_Sales))
#Create Decade Column
vgdata$Decade <- vgdata$Year
vgdata$Decade[vgdata$Year < 1990] <- '80s'
vgdata$Decade[vgdata$Year >= 1990 & vgdata$Year < 2000] <- '90s'
vgdata$Decade[vgdata$Year >= 2000 & vgdata$Year < 2010] <- '00s'
vgdata$Decade[vgdata$Year >= 2010 & vgdata$Year < 2020] <- '10s'
vgdata$Decade[vgdata$Year >= 2020] <- '20s'
vgdata$Decade <- as.factor(vgdata$Decade)
#Create Franchise Column
vgdata$Franchise <- ifelse(grepl('Pokemon',
vgdata$Name, ignore.case = TRUE), 'Pokemon',
ifelse(grepl('LEGO',
vgdata$Name, ignore.case = TRUE), 'LEGO',
ifelse(grepl('FIFA',
vgdata$Name, ignore.case = TRUE), 'FIFA',
ifelse(grepl('Madden',
vgdata$Name, ignore.case = TRUE), 'Madden',
ifelse(grepl('Cars',
vgdata$Name, ignore.case = TRUE), 'Cars',
ifelse(grepl('Need for Speed',
vgdata$Name, ignore.case = TRUE), 'Need for Speed',
ifelse(grepl('Resident Evil',
vgdata$Name, ignore.case = TRUE), 'Resident Evil',
ifelse(grepl('Call of Duty',
vgdata$Name, ignore.case = TRUE), 'Call of Duty',
ifelse(grepl('Halo',
vgdata$Name, ignore.case = TRUE), 'Halo',
ifelse(grepl('Final Fantasy',
vgdata$Name, ignore.case = TRUE), 'Final Fantasy',
ifelse(grepl('Guitar Hero',
vgdata$Name, ignore.case = TRUE), 'Guitar Hero',
ifelse(grepl('Rock Band',
vgdata$Name, ignore.case = TRUE), 'Rock Band',
ifelse(grepl('Batman',
vgdata$Name, ignore.case = TRUE), 'Batman',
ifelse(grepl('Battlefield',
vgdata$Name, ignore.case = TRUE), 'Battlefield',
ifelse(grepl('BioShock',
vgdata$Name, ignore.case = TRUE), 'BioShock',
ifelse(grepl('Fallout',
vgdata$Name, ignore.case = TRUE), 'Fallout',
ifelse(grepl('Borderlands',
vgdata$Name, ignore.case = TRUE), 'Borderlands',
ifelse(grepl('Forza',
vgdata$Name, ignore.case = TRUE), 'Forza',
ifelse(grepl('Assassin\'s',
vgdata$Name, ignore.case = TRUE), 'Assassin\'s Creed',
ifelse(grepl('Castlevania',
vgdata$Name, ignore.case = TRUE), 'Castlevania',
ifelse(grepl('Skylanders',
vgdata$Name, ignore.case = TRUE), 'Skylanders',
ifelse(grepl('Disney Infinity',
vgdata$Name, ignore.case = TRUE), 'Disney Infinity',
ifelse(grepl('Donkey Kong',
vgdata$Name, ignore.case = TRUE), 'Donkey Kong',
ifelse(grepl('Dragon Ball',
vgdata$Name, ignore.case = TRUE), 'Dragon Ball',
ifelse(grepl('Dragon Quest',
vgdata$Name, ignore.case = TRUE), 'Dragon Quest',
ifelse(grepl('Dynasty Warriors',
vgdata$Name, ignore.case = TRUE), 'Dynasty Warriors',
ifelse(grepl('ESPN',
vgdata$Name, ignore.case = TRUE), 'ESPN Sports',
ifelse(grepl('Grand Theft Auto',
vgdata$Name, ignore.case = TRUE), 'Grand Theft Auto',
ifelse(grepl('007',
vgdata$Name, ignore.case = TRUE), 'James Bond',
ifelse(grepl('Mario',
vgdata$Name, ignore.case = TRUE), 'Mario Brothers',
ifelse(grepl('Marvel',
vgdata$Name, ignore.case = TRUE), 'Marvel',
ifelse(grepl('Mega Man',
vgdata$Name, ignore.case = TRUE), 'Mega Man',
ifelse(grepl('Metal Gear Solid',
vgdata$Name, ignore.case = TRUE), 'Metal Gear Solid',
ifelse(grepl('Prince of Persia',
vgdata$Name, ignore.case = TRUE), 'Prince of Persia',
ifelse(grepl('Sonic', vgdata$Name,
ignore.case = TRUE), 'Sonic',
ifelse(grepl('Star Wars',
vgdata$Name, ignore.case = TRUE), 'Star Wars',
ifelse(grepl('Tales of',
vgdata$Name, ignore.case = TRUE), 'Tales of',
ifelse(grepl('The Legend of Zelda',
vgdata$Name, ignore.case = TRUE), 'Zelda',
ifelse(grepl('Tetris', vgdata$Name,
ignore.case = TRUE), 'Tetris',
ifelse(grepl('The Sims', vgdata$Name,
ignore.case = TRUE), 'The Sims', 'Other'
))))))))))))))))))))))))))))))))))))))))
vgdata$Franchise <- as.factor(vgdata$Franchise)
#Create Console_Company Column
vgdata$Console_Company <- as.character('Other')
vgdata$Console_Company[vgdata$Platform %in% c('XOne', 'XB',
'X360')] <- 'Microsoft'
vgdata$Console_Company[vgdata$Platform %in% c('PS', 'PS2', 'PS3', 'PS4',
'PSP', 'PSV')] <- 'Sony'
vgdata$Console_Company[vgdata$Platform %in% c('3DS', 'DS', 'GB', 'GBA', 'GC',
'N64', 'NES', 'SNES', 'NS', 'Wii', 'WiiU')] <- 'Nintendo'
vgdata$Console_Company[vgdata$Platform %in% c('DC', 'GEN', 'SAT',
'SCD')] <- 'Sega'
vgdata$Console_Company[vgdata$Platform %in% c('PC')] <- 'PC'
vgdata$Console_Company[vgdata$Platform %in% c('2600')] <- 'Atari'
vgdata$Console_Company <- as.factor(vgdata$Console_Company)
#head(vgdata)
#tail(vgdata)
NA.Units <- vgdata$NA_Sales
EU.Units <- vgdata$EU_Sales
JP.Units <- vgdata$JP_Sales
Other.Units <- vgdata$Other_Sales
Global.Units <- vgdata$Global_Sales
units <- data.frame(Global.Units,
NA.Units,
EU.Units,
JP.Units,
Other.Units)
```
## Univariate Plots Section
It will be important to understand a little bit about the data in this dataset
prior to working with it.
```{r echo=FALSE, VGData_Summary}
summary(vgdata)
```
Descriptive Statistics for each column in the dataset, which shows some
interesting breakdowns of the numbers at a glance. The dataset is made up
of 13 columns, with 10,000 rows of data. Of the Top 10k games with the
most units shipped, the minimum was 120,000 units, and the maximum was
82.54 million units globally.
### Numeric Values
```{r echo=FALSE, GamesMadeByYear_Histogram}
h1 = ggplot(na.omit(vgdata), aes(Year)) +
geom_histogram(binwidth = 1,
aes(fill = ..count..),
col = 'darkgreen',
alpha = .6) +
#geom_density(col = "#FF5733", # trend line
# aes(y = ..count..),
# alpha = 0,
# adjust = 2) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
aspect.ratio = 2 / 3) +
labs(x = 'Year',
y = 'Count') +
scale_fill_gradient("Count", high = "#78c95f", low = "#267b8c") +
scale_x_continuous(breaks = seq(1980, 2020, by = 2))
h1
```
First worth noting is this chart does not include the 157 rows where ```NA```
was entered for the ```Year```. It's interesting to see that the period from
2007 - 2011 was a peak time in games being released, with 2008 being the
highest year overall with new 828 titles.
This could be due to a wide range of factors, and doesn't necessarily translate
into high sales for all of the games that were released. The shape of the data
is single-modal, with a rightward skew, showing an overall rise in the number
of games released each year over time.
It's also worth noting that the data from 2017 forward is incomplete, as this
data pull was done in May 2017, so it may be wise for some explorations to only
look at full years counted (2016 or earlier). Next it will be interesting to
look at a few categorical breakdowns of the data.
```{r echo=FALSE, UnitsSold_Histogram}
h2 = ggplot(vgdata, aes(Global.Units)) +
geom_histogram(bins = 24,
aes(fill = ..count..),
col = 'darkgreen',
alpha = .8) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
aspect.ratio = 2 / 3,
legend.position = 'none') +
scale_fill_gradient("Count", high = "#a65481", low = "#311926")
h2
```
Breaking the global sales figures out in a histogram with 24 bins showing the
number of games shipped for each range of values yields a very different histogram
than the one for the Years column.
For global sales, there is an extreme leftward skew with a very tall first bar
(8000 games that shipped 2 million copies or less), with an extremely long tail
getting smaller and smaller as it goes to the right (very few games sell tens
of millions of copies). This is interesting, but using a log scale on this plot
would be more informative about the distribution of games selling less than a
million copies.
```{r echo=FALSE, Log10UnitsSold_Histogram}
h3 = ggplot(vgdata, aes(Global.Units)) +
geom_histogram(bins = 24,
aes(fill = ..count..),
col = 'darkgreen',
alpha = .8) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
aspect.ratio = 2 / 3,
legend.position = 'none') +
scale_fill_gradient("Count", high = "#a65481", low = "#311926") +
scale_x_log10(breaks = c(.12, .24, .48, .96, 1.92, 3.84, 7.68,
15.36, 30.72, 61.44, 122.88))
h3
```
This is a much more expected distribution of data, with most games selling
between 150k - 300k units. The distribution of global units shipped is single
modal with a leftward skew, indicating most of the top games have shipped a
couple hundred thousand units, with fewer and fewer games shipping millions
of units.
```{r echo=FALSE, Log10UnitsSoldRegions_Histogram}
h4 = ggplot(subset(vgdata, NA_Sales > 0), aes(NA_Sales)) +
geom_histogram(bins = 10,
aes(fill = ..count..),
col = 'darkgreen',
alpha = .6) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
aspect.ratio = 2 / 3,
legend.position = 'none') +
scale_fill_gradient("Count", high = "#ff8e29", low = "#4c2a0c") +
scale_x_log10(breaks = c(0, .12, .24, .48, .96, 1.92, 3.84, 7.68,
15.36, 30.72, 61.44, 122.88))
h5 = ggplot(subset(vgdata, EU_Sales > 0), aes(EU_Sales)) +
geom_histogram(bins = 10,
aes(fill = ..count..),
col = 'darkgreen',
alpha = .6) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
aspect.ratio = 2 / 3,
legend.position = 'none') +
scale_fill_gradient("Count", high = "#78c95f", low = "#182813") +
scale_x_log10(breaks = c(0, .12, .24, .48, .96, 1.92, 3.84, 7.68,
15.36, 30.72, 61.44, 122.88))
h6 = ggplot(subset(vgdata, JP_Sales > 0), aes(JP_Sales)) +
geom_histogram(bins = 10,
aes(fill = ..count..),
col = 'darkgreen',
alpha = .6) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
aspect.ratio = 2 / 3,
legend.position = 'none') +
scale_fill_gradient("Count", high = "#267b8c", low = "#0f3138") +
scale_x_log10(breaks = c(0, .12, .24, .48, .96, 1.92, 3.84, 7.68,
15.36, 30.72, 61.44, 122.88))
h7 = ggplot(subset(vgdata, Other_Sales > 0), aes(Other_Sales)) +
geom_histogram(bins = 10,
aes(fill = ..count..),
col = 'darkgreen',
alpha = .6) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
aspect.ratio = 2 / 3,
legend.position = 'none') +
scale_fill_gradient("Count", high = "#854c85", low = "#271627") +
scale_x_log10(breaks = c(0, .12, .24, .48, .96, 1.92, 3.84, 7.68,
15.36, 30.72, 61.44, 122.88))
grid.arrange(h4, h5, h6, h7, ncol = 2)
```
When the histogram is broken out by region, log10 no longer works well due to
the large number of games in each region that didn't ship enough copies to
register in the millions (or any at all), so to understand the long tail data,
the dataset needed to be subset. There's a similar leftward skew in all regions,
however there is more variation in where games land on the left side (which
could be due to the large number of games that never shipped or didn't sell
well in some regions).
What this shows is the curve of how games overall tend to ship in each region.
It's apparent that globally it's very common for games to ship somewhere
between 120k - 240k copies when they've made it into the top 10k bestselling
games on VGChartz, however in the Other region, that many units would be
considered a even larger success. In NA and EU, blockbuster titles have
shipped tens of millions more units than what would be considered a
blockbuster in JP and other regions.
### Categorical Values
```{r echo=FALSE, Genre_Bar}
b1 = ggplot(vgdata, aes(x = reorder(Genre,
Genre,
function(x) + length(x)))) +
geom_bar(aes(fill = ..count..),
col = 'darkgreen',
width = .8,
alpha = .6) +
theme(aspect.ratio = 2 / 3,
legend.position = 'none') +
labs(x = 'Genre',
y = 'Count') +
scale_fill_gradient("Count", high = "#78c95f", low = "#267b8c") +
coord_flip()
b1
```
Somewhat unsurprisingly, Action is the most popular genre of games to release.
Sports, is the next most popular genre, followed by Misc. Since 'Misc' isn't a
very descriptive field, it will be worth digging into game titles that fall into
this category to see if trends can be identified.
```{r echo=FALSE, Platform_Bar}
b2 = ggplot(vgdata, aes(x = reorder(Platform,
Platform,
function(x) + length(x)))) +
geom_bar(aes(fill = ..count..),
col = 'darkgreen',
width = .8,
alpha = .6) +
theme(aspect.ratio = 3 / 2,
legend.position = 'none') +
labs(#title = 'Bar Plot: Video Games Released by Platform',
#subtitle = 'Worldwide Releases',
x = 'Platform',
y = 'Count') +
scale_fill_gradient("Count", high = "#78c95f", low = "#267b8c") +
coord_flip()
b2
```
When the data is broken down by Platform, it starts becoming extremely varied.
Just looking at the above bar chart, we can see that the most popular system
for games titles tracked by VGChartz is the PlayStation 2 (PS2), which has
almost half again as many titles published on it than next platform, Nintendo
DS (DS). If we were to look at the other categorical variables in univariate
plots, readability would be extremely compromised, so it will be more
interesting to look at Console_Company and Franchise to group things together
a little more.
```{r echo=FALSE, Company_Bar}
b3 = ggplot(vgdata, aes(x = Console_Company)) +
geom_bar(aes(fill = ..count..),
col = 'darkgreen',
width = .8,
alpha = .6) +
theme(aspect.ratio = 2 / 3,
legend.position = 'none') +
labs(x = 'Platform Company',
y = 'Count') +
scale_fill_gradient("Count", high = "#78c95f", low = "#267b8c")
b3
```
When it comes to titles published on a console, looking at just this plot,
Sony is winning the console wars overall. However, there are other ways
to examine the data.
```{r echo=FALSE, Publisher_Bar}
# create new data frame for publisher count
pubs <- data.frame(table(vgdata$Publisher))
colnames(pubs)[colnames(pubs) == 'Var1'] <- 'Publisher'
colnames(pubs)[colnames(pubs) == 'Freq'] <- 'Count'
Top.Publishers <- subset(pubs, pubs$Count > 100)
summary(Top.Publishers)
b4 = ggplot(Top.Publishers,
aes(reorder(Publisher, Count), Count)) +
geom_bar(stat = 'identity',
aes(fill = Count),
col = 'darkgreen',
width = .8,
alpha = .6) +
theme(aspect.ratio = 3 / 2,
legend.position = 'none') +
labs(#title = 'Bar Plot: Publishers with Greater Than 100 Game Titles',
#subtitle = 'Worldwide Releases',
x = 'Platform',
y = 'Count') +
scale_fill_gradient("Count", high = "#78c95f", low = "#267b8c") +
coord_flip()
b4
```
This summary gives us a more detailed view into the counts of games published
for all publishers with more than 100 games on the market. These are the most
prolific, and also the most recognizable, publishers in the industry, however
publishing a lot of games is not an indication that those games shipped a
large quantity.
Somewhat interesting (yet unsurprising) to see is that Electronic Arts, then
Activision, then Nintendo are the top publishers by count. EA is notorious for
shipping yearly games for their sports franchises, and all three companies are
some of the largest and most well-known publishers.
```{r echo=FALSE, Name_Bar}
# create new data frame for name count
names <- data.frame(table(vgdata$Name))
colnames(names)[colnames(names) == 'Var1'] <- 'Name'
colnames(names)[colnames(names) == 'Freq'] <- 'Count'
Top.Names <- subset(names, names$Count >= 7)
summary(Top.Names)
b5 = ggplot(Top.Names,
aes(reorder(Name, Count), Count)) +
geom_bar(stat = 'identity',
aes(fill = Count),
col = 'darkgreen',
width = .8,
alpha = .6) +
theme(aspect.ratio = 3 / 2,
legend.position = 'none') +
labs(#title = 'Bar Graph: Game Titles on the Most Platforms',
#subtitle = 'Worldwide Releases',
x = 'Game Title',
y = 'Count') +
scale_fill_gradient("Count", high = "#78c95f", low = "#267b8c") +
coord_flip()
b5
```
At first glance, this chart is blocky and less interesting than the previous
ones, however when looking at the game titles themselves, there are some
immediate observations. First, it becomes clear that if a game title is showing
up more than once, it's been published on more than one platform.
This chart shows just how many of the most cross-published video games are LEGO
related. Beyond that, there's some other noticeable patterns in the game titles.
This is going to be a really interesting column to drill down into further,
so we'll look at Franchise data. This could also lead to some scenarios where
we might need to consider duplicated counting.
It should be noted that Franchise data has been limited to popular and
frequently occurring titles.
```{r echo=FALSE, Franchise_Bar}
# create new data frame for franchise count
frans <- data.frame(table(vgdata$Franchise))
colnames(frans)[colnames(frans) == 'Var1'] <- 'Franchise'
colnames(frans)[colnames(frans) == 'Freq'] <- 'Count'
Top.Franchises <- subset(frans, Franchise != 'Other')
summary(Top.Franchises)
b6 = ggplot(Top.Franchises,
aes(reorder(Franchise, Count), Count)) +
geom_bar(stat = 'identity',
aes(fill = Count),
col = 'darkgreen',
width = .8,
alpha = .6) +
theme_set(theme_minimal(8)) +
theme(aspect.ratio = 3 / 2,
legend.position = 'none') +
labs(#title = 'Bar Graph: Game Titles on the Most Platforms',
#subtitle = 'Worldwide Releases',
x = 'Franchise',
y = 'Count') +
scale_fill_gradient("Count", high = "#78c95f", low = "#267b8c") +
coord_flip()
b6
```
It's interesting to see which franchises have released the most games, however
this data might be better viewed as a bivariate comparison of Franchise vs
units shipped instead of a count of titles shipped, since titles can occur on
more than one platform.
# Univariate Analysis
### What is the structure of your dataset?
```{r echo=FALSE, VGData_Structure}
str(vgdata)
```
Converting the data frame to a string shows a basic breakdown of how the data
was ingested. The Year column is showing as a factor due to the 'N/A' values.
It should be a continuous variable, so this will need to be taken into
consideration as the analysis progresses.
There are 10,001 rows broken out across 13 variables. When broken down by year,
the shape of the data is single-modal, with a rightward skew, showing an
overall rise in the number of games released each year over time. Of the Top
10k games with the most units shipped, the minimum was 120,000 units, and the
maximum was 82.54 million units globally.
### What is/are the main feature(s) of interest in your dataset?
VGChartz exists to track game sales in millions across global regions. The
data consists of categorical columns, and the year column (independent
variables), as well as numeric columns for sales in millions of units (Global
Sales; dependent variable). When properly rearranged, the dependent variable
can be drilled down by Region. All categorical columns contain data of interest
and could affect the number of unit shipped regionally or globally.
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
```{r echo=FALSE, VGData_Correlation}
games.corr <- data.frame(NA.Units,
EU.Units,
JP.Units,
Other.Units,
Global.Units)
cor(games.corr, method = 'pearson')
```
Taking a look at correlation between games sales across regions could help
predict if sales in one region based on another regions sales. This is more
informative than just looking at global units shipped alone.
Additionally, the Name column is going to prove very interesting when looking
at total sales, and cross comparing other columns. Using regexes to find
titles that are part of popular series will be value in determining if a game
should sell well.
### Did you create any new variables from existing variables in the dataset?
I created three new columns of data organizing the information that was in the
dataset. One column for decade a game title was released (based on the Year
column), one column for Series (based on frequently occurring game titles and
popular franchises), and one column for Console_Company to better visualize the
competition between the big three console makers.
Additionally, I've created new summarized datasets to view the information in
different ways.
### Of the features you investigated, were there any unusual distributions?
There didn't appear to be any unusual distributions when broken out by year,
though it was interesting to see that games published per year peaked in 2008,
and hasn't continued to rise. This could be due to a lot of factors.
When examining the Global_Sales data in a histogram using log scale, there was a
single-modal, leftward skewed distribution, which fell within expectations that
most games don't ship more than a million units.
### Did you perform any operations on the data to tidy, adjust, or \
change the form of the data? If so, why did you do this?
This data was obtained on 5/22/2017 using a Python3 script, and importing
BeautifulSoup to parse out the HTML data. After the dataset was scraped from
the table on VGChartz website, it was then limited to the top 10,000 rows, and
formatted using a data frame before being output to CSV for use in R. Since the
games are ranked by units shipped, they can be ordered by Global_Sales, so Rank
data isn't really important and was dropped. I also generated new columns based
on data the existing dataset contained, since grouping the categorical data a
bit more would allow for answering a wider variety of questions.
Aside from this, the data was left in it's original state, as it was downloaded
in a tidy format. When working with the data in R, numerical values needed to
be transformed into numeric, and categorical variables were transformed into
factors.
# Bivariate Plots Section
```{r echo=FALSE, Correlation_Matrix}
#Reference: https://stackoverflow.com/questions/16194212/how-to-suppress-warni
#ngs-globally-in-an-r-script
oldw <- getOption("warn")
options(warn = -1)
melt.corr <- melt(cor(games.corr))
#Reference: http://www.sthda.com/english/wiki/correlation-matrix-formatting-an
#d-visualization#at_pco=smlwn-1.0&at_si=5927942aff9eb0fb&at_ab=per-2&at_pos=0&
#at_tot=1
cp1 <- chart.Correlation(games.corr,
histogram = FALSE)
options(warn = oldw)
```
Taking a look at correlation between video game units shipped across regions could
help predict sales in one region based on another regions units This
correlation matrix compares numeric values from the dataset (aside from year).
In general, if a game sells well in North America, this shows it does well in
Europe, and globally. It will be interesting to see how the categorical values
impact numerical ones in bivariate and multivariate exploration. We can see
that most regions have at least strong or moderate positive correlation with
each other, with a exception for JP/Other which have weak positive correlation.
```{r echo=FALSE, Summary_Correlation}
summary(games.corr)
```
```{r echo=FALSE, GlobalSales_ScatterPlot}
#Scatter plot for mean and qualitle for global units shippeed
pp1 = ggplot(na.omit(vgdata), aes(Year, Global_Sales)) +
geom_point(col = '#a65481',
alpha = .1,
position = 'jitter') +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.1),
linetype = 2,
col = 'darkred') +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.9),
linetype = 2,
col = 'blue') +
geom_line(stat = 'summary', fun.y = mean) +
theme(axis.text.x = element_text(angle = 90,
hjust = 1),
aspect.ratio = 2 / 3) +
labs(x = 'Year',
y = 'Global Units') +
coord_trans(y = 'log10') +
scale_x_continuous(breaks = seq(1980, 2020, by = 2)) +
scale_y_continuous(breaks = c(.12, .24, .48, .96, 1.92,
3.84, 7.68, 15.36, 30.72,
61.44, 122.88))
pp1
```
The jittered point plot showing number of units shipped per game per year shows
some really interesting patterns. This plot was done with a log scale on the Y
axis to better visualize the price breakdowns of the long tail global sales
data. The black line shows the mean, the dotted blue is the 90th percent
quantile, and the dotted red line is the 10% quantile.
Once the plot is generated, on the low end of the graph for games that shipped
between 120k and 180k unit, there is some very obvious horizontal striping.
This occurs because the way the data is tracked by VGCharts is a 2 decimal
value representing millions. For games with only a couple hundred thousand
units shipped, there are only so many levels to track these values with two
decimal places.
We can see in the early years of video games, not nearly as
many games were made, and of the ones that were made, most sold a lot of
copies. This continued into the early 90's, when a lot more games started
saturating the market. Since there is such a big difference in number of
games made over the years, this would be interesting to visualize with
averages.
Showing average global units shipped by year on an overlaid line plot gives
a much clearer picture of the observations from the point plot, so I added a
summary line by mean. There are two large spikes in 1985 and 1989 where the
average units shipped per title was over 4 million. This is exceptionally high
compared to the much more stable average over the last 20 years of about 750k -
1 million units per title through 2016. What if we look at these numbers split
out by region?
```{r echo=FALSE, RegionalSales_ScatterPlot}
oldw <- getOption("warn")
options(warn = -1)
theme_set(theme_minimal(10))
Full.Year <- subset(vgdata, Year <= 2016)
pp2 = ggplot(na.omit(Full.Year), aes(Year, NA_Sales)) +
geom_point(col = '#ff8e29',
alpha = .07,
position = 'jitter') +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.1),
linetype = 2,
col = 'darkred') +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.9),
linetype = 2,
col = 'blue') +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = .5),
aspect.ratio = 2 / 3) +
coord_trans(y = 'sqrt', limy = c(0, 11)) +
scale_x_continuous(breaks = seq(1980, 2016, by = 4)) +
scale_y_continuous(breaks = c(0, .25, 1, 3, 6, 11))
pp3 = ggplot(na.omit(Full.Year), aes(Year, EU_Sales)) +
geom_point(col = '#78c95f',
alpha = .07,
position = 'jitter') +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.1),
linetype = 2,
col = 'darkred') +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.9),
linetype = 2,
col = 'blue') +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = .5),
aspect.ratio = 2 / 3) +
coord_trans(y = 'sqrt', limy = c(0, 3)) +
scale_x_continuous(breaks = seq(1980, 2016, by = 4)) +
scale_y_continuous(breaks = c(0, .25, 1, 3))
pp4 = ggplot(na.omit(Full.Year), aes(Year, JP_Sales)) +
geom_point(col = '#267b8c',
alpha = .07,
position = 'jitter') +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.1),
linetype = 2,
col = 'darkred') +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.9),
linetype = 2,
col = 'blue') +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = .5),
aspect.ratio = 2 / 3) +
coord_trans(y = 'sqrt', limy = c(0, 6)) +
scale_x_continuous(breaks = seq(1980, 2016, by = 4)) +
scale_y_continuous(breaks = c(0, .25, 1, 3, 6))
pp5 = ggplot(na.omit(Full.Year), aes(Year, Other_Sales)) +
geom_point(col = '#854c85',
alpha = .07,
position = 'jitter') +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.1),
linetype = 2,
col = 'darkred') +
geom_line(stat = 'summary',
fun.y = quantile,
fun.args = list(probs = 0.9),
linetype = 2,
col = 'blue') +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = .5),
aspect.ratio = 2 / 3) +
coord_trans(y = 'sqrt', limy = c(0, 3)) +
scale_x_continuous(breaks = seq(1980, 2016, by = 4)) +
scale_y_continuous(breaks = c(0, .25, 1, 3))
grid.arrange(pp2, pp3, pp4, pp5, ncol = 2)
options(warn = oldw)
```
This plot takes the Global.Units scatter plots, and cleans it up by breaking
it out regionally. The plots use sqrt instead of log10 so we can take into
account zero units shipped in a region. These visualizations give a lot more
information when broken out by region. Since regions are columns, each region
has been plotted together on a grid area since they couldn't be easily faceted
like a value. This problem was remedied for other plots, however this
visualization ended up working for this, so I kept it as it is.
The plot for NA (North America, not N/A) ticks up to 11 million units, while
the plots for EU and Other tick to 3 million, and the JP plot ticks to 6
million. For Japan, the low trend seems to make sense since it's only a single
country, while it's somewhat surprising that places in Other (like the rest of
Asia, and South America) don't generate more shipped units. Since these views
are zoomed in, not all Y-axis outliers are visible, and only values from
complete years (2016 or earlier) are shown.
In each region, we can still clearly see bumps around 1985 and 1989, so the
great successes that shipped millions of units those years did so globally.
Games represented by a dot above the 90th percentile line have done
exceptionally well on the market in that region. Additionally, the solid
stripe on the bottom of each region indicates games that shipped zero in
that region, so perhaps they weren't released there, or didn't ship enough
copies to be counted as millions.
For Japan, it's interesting to see that average units shipped has dropped
and leveled out over time as the number of titles released increased (we see
this pattern in NA, too). It also looks like there is a lot of games that
didn't ship or do well there, as JP has the most defined bottom stripe.
Maybe it really is impressive if something is big in Japan! For EU and Other
regions, there has been a gradual increase is sales as the years go on,
likely due in part to a slower market saturation in those areas.
For the last 20 years, it's been common for games to ship around 250k or
more copies in NA and EU, while JP and Other each average about half of that.
```{r echo=FALSE, Summary_Genre}
Summary.Genre = na.omit(vgdata) %>%
group_by(Genre) %>%
summarise(NA.Sum = sum(NA_Sales),
NA.Mean = mean(NA_Sales),
NA.Median = median(NA_Sales),