-
Notifications
You must be signed in to change notification settings - Fork 2
/
nflscrapr-dataviz-cookbook.rmd
2337 lines (1845 loc) · 79.4 KB
/
nflscrapr-dataviz-cookbook.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Tom's Cookbook for Better Viz"
date: "Last updated: `r format(Sys.Date())`"
output:
html_document:
toc: true
toc_depth: 4
toc_float: true
theme: cosmo
dpi: 500
fig_width: 10
fig_height: 8
---
```{r setup, include=FALSE, echo=TRUE, warning=FALSE}
library(tidyverse) # Data Cleaning, manipulation, summarization, plotting
library(gt) # beautiful tables
library(DT) # beautiful interactive tables
library(ggthemes) # custom pre-built themes
library(bbplot) # more themes
library(ggtext) # custom text color
library(teamcolors) # NFL team colors and logos
library(ggforce) # better annotations
library(ggridges) # many distributions at once
library(ggrepel) # better labels
library(ggbeeswarm)
library(waffle)
library(rvest) # webscraping in R
library(glue) # awesome text creation
library(gganimate) # create gifs with ggplot
library(ggimage) # load team images in ggplot
pbp <- read_csv("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.csv.gz")
```
## How to improve your `nflfastR` graphics
This resource is modeled after the fantastic [BBC Graphics Cookbook](https://bbc.github.io/rcookbook/), which is also worth checking out. The `nflscrapR` team ([Maksim Horowitz](https://twitter.com/bklynmaks/), [Ron Yurko](https://twitter.com/Stat_Ron/), and [Sam Ventura](https://twitter.com/stat_sam/)) have compiled easy to access play-by-play stats opening a deeper world of NFL analytics for reporters, bloggers and enthusiasts (and probably some NFL teams). This work has been extended in [`nflfastR`](https://mrcaseb.github.io/nflfastR/) by [Sebastian Carl](https://twitter.com/mrcaseb) and [Ben Baldwin](https://twitter.com/benbbaldwin). [Ben Baldwin](https://twitter.com/benbbaldwin) has compiled a [quickstart guide](https://mrcaseb.github.io/nflfastR/articles/beginners_guide.html) to using this data. As such, this resource is not aimed at reproducing that tutorial, but giving you some quick guides for improving the graphics you create via `ggplot2`. It's easy to get started quickly exploring the data with `ggplot2` and hopefully this helps with your "publication" quality plots.
I am providing a lot of my own **opinion** on certain dataviz choices - everyone is allowed to make their own decisions with regards to colors, ink use, chart type - but I do hope that this resource opens your eyes to some of the art of dataviz now that you have made progress with the science.
The source code for this webpage is on [Github](https://github.com/jthomasmock/nfl_plotting_cookbook) if you want to take a look.
## Additional Resources
If you'd rather go deeper into a textbook and ignore specific applications related to `nflscrapR`, check out these amazing free online resources (some available in print as well):
| Title/Link | Author | Description |
| :--- | :--- |:---------------|
| [R for Data Science](https://r4ds.had.co.nz/) | Hadley Wickham, Garret Grolemund | A great overview of the `tidyverse`, covers everything from reading data in, data manipulation/summarization, data viz, and general programming in R |
| [SocViz](https://socviz.co/lookatdata.html#lookatdata) | Kieran Hiely | Covers exactly HOW to create a lot of different plot types in `R`/`ggplot2` |
| [Fundamentals of Data Viz](https://serialmentor.com/dataviz/) | Claus Wilke | Covers the WHY of Data Viz where all examples are in R, but no code examples in the book, but are available on his [GitHub](https://github.com/clauswilke/dataviz) |
| [BBPlot Cookbook](https://bbc.github.io/rcookbook/) | BBC Data Team | Intro primer to news-style graphics in `ggplot2` |
| [`ggplot2` cookbook](http://www.cookbook-r.com/Graphs/) | Winston Chang | Quick cookbook of `ggplot2` plots |
| [R Graph Gallery](https://www.r-graph-gallery.com/) | Yan Holtz | Cookbook examples of a majority of plot types. |
| [`ggplot2` Book ](https://ggplot2-book.org/) | Hadley Wickham, Danielle Navarro | This 3rd edition of the `ggplot2` book is currently under development, but also available freely online for the first time! A more technical book that should align well with either SocViz or Fundamentals of Data Viz |
## Useful code chunks
There are a couple features that we will use throughout these examples:
### `dplyr::if_else()`
This allows you to make a binary conversion.
For example `if_else(condition, true, false)`
* `mutate(success = if_else(epa > 0, 1, 0))`
* `mutate(color = if_else(posteam == "PIT", "yellow", "grey))`
### `dplyr::case_when()`
This allows you to essentially use many `if_else` statements at once
* The `~` indicates an assignment, where if the left side statement is evaluated as TRUE then the outcome is `~` (assigned) to the right side.
* The right side can be a number, text, etc
* The left side can be a simple or complex statement, but must evaluate as TRUE/FALSE (logical)
* The final `TRUE ~ NA_character_` is basically a "catch" - if none of the other cases are met, then it will default to NA
* In this case we use `NA_character_` from `dplyr`, but you could also have a situation where it could simply say "nope" or revert back to some other column
* If you want to have the right side (assignment) be a number, you'll need to use `NA_integer_`
* Lastly, a longer `case_when()` is presented shortly below
```{r}
pbp %>%
mutate(
stick_throw = case_when(
air_yards < ydstogo ~ "Short of Sticks",
air_yards == ydstogo ~ "At Stick",
air_yards > ydstogo ~ "Past Stick",
TRUE ~ NA_character_
)
) %>%
select(air_yards, ydstogo, stick_throw) %>%
filter(!is.na(air_yards))
```
### `scale_color_identity()`
This is useful in combination with the above example of assigning color in a plot, essentially it will take the "yellow" or "grey" argument automatically.
### `scale_color_manual()`
This allows you to specify colors of interest like `scale_color_manual(values = c("red", "black"))`
### `forcats::reorder()`
This allows you to reorder levels of a `ggplot` by another variable.
eg `reorder(posteam, epa)`
### `Helpers`
There are a few helpers used frequently throughout.
* `!` indicates not or negation, so `x != 5` means x not equal to 5.
* `!is.na(x)` indicates x is NOT NA
* `%in%` means in - so `x %in% c(2, 3, 4)` means x matches 2, 3 OR 4
* `dplyr::between(x, left, right)` - shortcut for `x >= left & x <= right`
* `hjust`/`vjust` - this is typically assigned 0 through 1, and adjusts either the horizontal or vertical alignment
### `ggplot2` specs
The [documentation for `ggplot2`](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html) cover in great detail MANY options for minor but important customizations. I'm not adding it directly here but adding as a resource. It is definitely worth parsing through, and some examples below:
* lines (size, color, type, join, end)
* points (size, color, fill, stroke)
* text (size, face)
* justification (hjust, vjust, nudge_x, nudge_y)
### `teamcolors` package
Gives you ALL the colors for NFL teams, although `nflfastR` also provides colors and logos via `nflfastR::teams_colors_logos`.
#### Using `teamcolors`
```{r}
filter(teamcolors, league == "nfl")
# or
nflfastR::teams_colors_logos
```
Please note that teams are listed by full name so to use them with the play-by-play data you will need to "join" the `teamcolors` and play-by-play datasets together.
The list of short teams named could be accomplished like so:
```{r}
left_join(pbp, nflfastR::teams_colors_logos, by = c("posteam" = "team_abbr"))
```
You could then use `dplyr::left_join()` to join the full full_team_names, colors, and team logos to the play-by-play data. Without getting into the weeds TOO much, a `left_join` basically finds cases where there is a matching row in the common column (posteam) for both dataframes, and then adds the additional columns from nfl_colors to the play-by-play data. Joins are a very important concept when trying to combine multiple datasets, and if you want to read more about the various types and their use cases check out the [`dplyr` joins docs](https://dplyr.tidyverse.org/reference/join.html).
Quick example below:
```{r, eval = FALSE}
# read in data
pbp <- read_csv("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.csv.gz")
```
```{r}
# left_join the data together
pbp_colors <- left_join(pbp, nflfastR::teams_colors_logos, by = c("posteam" = "team_abbr"))
pbp_colors %>%
# Excludes non-plays, eg end of quarter
filter(!is.na(posteam)) %>%
select(posteam, team_name, team_color, team_color2, team_logo_wikipedia) %>%
# Distinct grabs only the distinct/unique cases of column
distinct(posteam, .keep_all = TRUE)
```
So we can see that the posteam and team_abb are equivalent, where the full team name, colors, and logo are also added. I dropped the other 250+ columns for printing here, but they would be in the complete dataframe.
### `ggsave()`
If you are going to export your graphics, it's worth it to go through `ggsave()` rather than the `RStudio` export button.
The full [docs](https://ggplot2.tidyverse.org/reference/ggsave.html) have lots of great info but I'll summarize it here. The basic arguments in pseudocode are below.
```{r, eval = FALSE}
ggsave("plot_name.png", plot_object,
height = x, width = y, units = "in", dpi = "300")
```
A typical call of `ggsave` would look like the below.
```{r, eval = F}
ggsave("wr_epa.png", wr_epa_plot,
height = 6, width = 8, units = "in", dpi = "350")
```
Arguably, the most important part is the DPI call - if you save through the export button you will typically have a low DPI (72) that has jagged edges on lines (known as aliasing), as opposed to exporting with a higher DPI which will give a higher quality appearance.
You will likely spend some time perfecting the print size of your plots, but if you use your own theme with text sized appropriately you can typically set a specific DPI and work from there.
### Changing fonts
Changing fonts for graphics in R can be easy if you use a package like `extrafont` or `showtext`. You can then change font family in your theme calls or as part of your personal theme.
`extrafont` has an [example](https://cran.r-project.org/web/packages/extrafont/README.html) walking through it's use.
`showtext` has an [example](https://cran.rstudio.com/web/packages/showtext/vignettes/introduction.html) walking through it's use.
## Prep
### Load all the libraries you need
There are a few packages I will use in this guide, most of them related to data viz.
```{r eval=FALSE}
library(tidyverse) # Data Cleaning, manipulation, summarization, plotting
library(gt) # beautiful tables
library(DT) # beautiful interactive tables
library(ggthemes) # custom pre-built themes
library(bbplot) # more themes
library(ggtext) # custom text color
library(teamcolors) # NFL team colors and logos
library(ggforce) # better annotations
library(ggridges) # many distributions at once
library(ggrepel) # better labels
library(ggbeeswarm) # beeswarm plots
library(extrafont) # for extra fonts
```
### Read in the pbp data
This is taken almost verbatim from Ben's Tutorial, but the idea is that you are adjusting the dataset to be ready for analysis. If you are interested in plays beyond pass/rush then you should probably NOT do these steps.
```{r, message=FALSE, warning=FALSE, eval = FALSE}
pbp <- read_csv("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.csv.gz")
```
## Our first data summary
This is also credited to Ben:
"Let's look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:"
```{r}
kc <- pbp %>%
filter(wp > .20 & wp < .80 & down <= 2 & qtr <= 2 & half_seconds_remaining > 120) %>%
group_by(posteam) %>%
summarize(mean_pass = mean(pass),
plays = n()) %>%
arrange(mean_pass)
kc
```
"Kansas City led the league in passing rate 2019. Fun! Let's see what that looks like:"
```{r}
ggplot(kc, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam))
```
Now this is a useful plot, but as Ben said:
"This image is kind of a mess -- we still need a title, axis labels, etc -- but gets the point across. We'll get to that other stuff later."
Let's get to that stuff now!
### Themes
`ggplot2` out of the box comes with a bunch of themes, things like `theme_bw()`, `theme_minimal()`, `theme_classic()`, and the default `theme_grey()`.
Let's see what they look like with the same plot as above.
`theme_bw()`
- Notice that we now have grey gridlines, a black border, and a white background.
```{r, echo = FALSE}
ggplot(kc, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam)) +
theme_bw()
```
`theme_minimal()`
- Notice that we still have grey gridlines, a white background, but now no black border.
```{r, echo = FALSE}
ggplot(kc, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam)) +
theme_minimal()
```
`theme_classic()`
- Notice that we now have NO gridlines, a half black border, and the same white background.
```{r, echo = FALSE}
ggplot(kc, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam)) +
theme_classic()
```
But as with almost everything in `R`, there are more packages that add more functionality! In this case, there are entire packages dedicated to themes in `ggplot2` and you have the ability to build your own themes!
### More themes
```{r, eval = FALSE}
library(ggthemes)
library(bbplot)
```
The [`ggthemes` package](https://github.com/jrnold/ggthemes) gives you a wide assortment of additional themes as seen [here](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/). Most importantly it also gives you ideas about customizations to your personal theme. If you parse through the source code, you can create your own theme and utilize across your visualizations.
`theme_fivethirtyeight()`
- The difference from `theme_minimal()` is ironically, *minimal* but the main difference is heavier grey gridlines, and a subtle grey background - which aligns with the `FiveThirtyEight` style.
```{r, echo = FALSE}
ggplot(kc, aes(x = reorder(posteam, -mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam)) +
theme_fivethirtyeight()
```
Again, the exciting part about `ggthemes` in my mind is the concept of creating your **own** theme. In fact, the code for this theme is pretty simple!
```{r, eval = FALSE}
theme_fivethirtyeight <- function(base_size = 12, base_family = "sans") {
colors <- deframe(ggthemes::ggthemes_data[["fivethirtyeight"]])
(theme_foundation(base_size = base_size, base_family = base_family)
+ theme(
line = element_line(colour = "black"),
rect = element_rect(
fill = colors["Light Gray"],
linetype = 0, colour = NA
),
text = element_text(colour = colors["Dark Gray"]),
axis.title = element_blank(),
axis.text = element_text(),
axis.ticks = element_blank(),
axis.line = element_blank(),
legend.background = element_rect(),
legend.position = "bottom",
legend.direction = "horizontal",
legend.box = "vertical",
panel.grid = element_line(colour = NULL),
panel.grid.major =
element_line(colour = colors["Medium Gray"]),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0, size = rel(1.5), face = "bold"),
plot.margin = unit(c(1, 1, 1, 1), "lines"),
strip.background = element_rect()
))
}
```
### Edited Theme
I personally edited this so that it didn't remove axis titles, and to have a white background instead of gray, which you can see below.
```{r}
theme_538 <- function(base_size = 12, font = "Lato") {
# Text setting
txt <- element_text(size = base_size + 2, colour = "black", face = "plain")
bold_txt <- element_text(
size = base_size + 2, colour = "black",
family = "Montserrat", face = "bold"
)
large_txt <- element_text(size = base_size + 4, color = "black", face = "bold")
theme_minimal(base_size = base_size, base_family = font) +
theme(
# Legend Settings
legend.key = element_blank(),
legend.background = element_blank(),
legend.position = "bottom",
legend.direction = "horizontal",
legend.box = "vertical",
# Backgrounds
strip.background = element_blank(),
strip.text = large_txt,
plot.background = element_blank(),
plot.margin = unit(c(1, 1, 1, 1), "lines"),
# Axis & Titles
text = txt,
axis.text = txt,
axis.ticks = element_blank(),
axis.line = element_blank(),
axis.title = bold_txt,
plot.title = large_txt,
# Panel
panel.grid = element_line(colour = NULL),
panel.grid.major = element_line(colour = "#D2D2D2"),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.border = element_blank()
)
}
```
Now let's see what the edited theme looks like in action!
```{r, fig.height = 10, fig.width=16, dpi = 500}
ggplot(kc, aes(x = reorder(posteam, -mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam)) +
theme_538()
```
Regardless - the idea here is that you can:
* Use a built in theme (theme_bw, theme_minimal, etc)
* Use a pre-built theme (bbplot, ggthemes, etc)
* Or build your own theme!
All are valid, but you don't necessarily have to actually manually code the theme element changes to each and every plot. You can at the least write your own theme as a function and use it. Alternatively, you can write your own package (easier than it sounds!) and source that.
If you would like to read more about customizing your OWN theme - check out the great resource by Simon Jackson at his [blog](https://drsimonj.svbtle.com/creating-corporate-colour-palettes-for-ggplot2).
## Line Charts
Basic line chart = `ggplot() + geom_line()`
```{r}
# Prepare data
wr_duel <- pbp %>%
filter(receiver %in% c("C.Godwin", "M.Evans")) %>%
group_by(week, receiver) %>%
summarize(mean_epa = mean(epa, na.rm = TRUE))
ggplot(
wr_duel,
aes(x = week, y = mean_epa, color = receiver)
) +
geom_line(size = 1)
```
### Let's improve this a bit.
```{r}
wr_duel_plot <- ggplot(
wr_duel,
aes(x = week, y = mean_epa, color = receiver)
) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(
x = "\nGame Date",
y = "EPA (Average)",
title = "Quick comparison of Godwin vs Evans across the 2019 season",
caption = "Data: @nflfastR | Plot: @thomas_mock"
)
wr_duel_plot
```
### Change the colors
But we can still improve this a lot - it feels a bit crowded, plus the red/blue colro scheme doesn't align with the team's color or anything else. We can add colored text via the [`ggtext` package](https://github.com/clauswilke/ggtext), or we can manually change the colors. Also note that you can grab the team's colors via `teamcolors` package.
```{r}
tb_colors <- nflfastR::teams_colors_logos %>%
filter(team_abbr == "TB")
tb_colors
tb_primary <- pull(tb_colors, team_color)
tb_secondary <- pull(tb_colors, team_color4)
```
### Asign the colors
```{r}
wr_duel_plot <- wr_duel %>%
ggplot(
aes(x = week, y = mean_epa,
color = if_else(receiver == "C.Godwin", tb_primary, tb_secondary)
)
) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "grey") +
labs(x = "",
y = "EPA (Average)",
title = glue::glue("Quick comparison of <span style='color:{tb_primary}'>**Godwin**</span> vs <span style='color:{tb_secondary}'>**Evans**</span> across the 2019 season"),
caption = "Data: @nflfastR") +
scale_color_identity() +
scale_x_continuous(breaks = seq(1, 17, 1)) +
scale_y_continuous(breaks = seq(-1, 2, 0.5)) +
theme(plot.title = element_markdown())
wr_duel_plot
```
### Add the legend back
Alternatively, if you didn't want to drop a legend, you could approach it this way.
```{r}
wr_duel_plot <- ggplot(
wr_duel,
aes(
x = week, y = mean_epa,
color = receiver
)
) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(
x = "",
y = "EPA (Average)",
title = "Quick comparison of Godwin vs Evans across the 2019 season",
caption = "Data: @nflfastR | Plot: @thomas_mock"
) +
scale_color_manual(values = c(tb_primary, tb_secondary)) +
scale_x_continuous(breaks = seq(1, 17, 1)) +
scale_y_continuous(breaks = seq(-1, 2, 0.5)) +
theme(
legend.title = element_blank(),
legend.position = "top"
)
wr_duel_plot
```
### Or try direct labeling!
```{r}
wr_duel_plot +
theme(legend.position = "none") +
geom_text(data = filter(wr_duel, week == 2),
aes(x = week, y = mean_epa, label = receiver),
hjust = 0, nudge_x = 0.2, size = 4, fontface = "bold"
) +
geom_point(data = filter(wr_duel, week == 2),
size = 3
)
```
## Bar Charts
Everyone's favorite - bar charts! But always remember that bar charts can limit information - we'll look at distribution plots of various types later, but for now back to the bar.
Basic forms:
* `ggplot(aes(x = category, y = value)) + geom_col()`
* `ggplot(aes(x = category, y = value)) + geom_bar(stat = "identity")`
Column defaults to plotting identity, essentially the single number is read as the max value. `geom_bar()` on the other hand has a bit more flexibility if you want to build stacked or segmented bar charts.
```{r}
rb_trio <- pbp %>%
filter(
posteam == "BAL",
receiver %in% c("M.Ingram", "M.Ingram II", "G.Edwards", "J.Hill") |
rusher %in% c("M.Ingram", "M.Ingram II", "G.Edwards", "J.Hill"),
play_type %in% c("run", "pass")
) %>%
mutate(
# Assign a single player name for filtering regardless of play_type
player = if_else(is.na(receiver), rusher, receiver),
player = if_else(str_detect(player, "Ingram"), "M.Ingram", player),
player = factor(player, levels = c("M.Ingram", "G.Edwards", "J.Hill")),
# Add nice labels to play_type
play_type = factor(play_type, labels = c("Reception", "Rush"))
) %>%
group_by(player, play_type) %>%
summarize(
n = n(),
mean_yards = sum(yards_gained, na.rm = TRUE) / n,
mean_success = sum(success, na.rm = TRUE) / n
)
rb_trio_plot <- rb_trio %>%
ggplot(aes(x = player, y = mean_yards)) +
geom_col(aes(fill = play_type), position = "dodge")
rb_trio_plot
```
Something to notice above - we have created a "grouped" bar chart, where the bars are grouped by player and color is assigned to play type. We can split this out into facets as an alternative representation.
### Facets
```{r}
rb_trio_plot <- rb_trio %>%
ggplot(aes(x = player, y = mean_yards, fill = player, position = "dodge", group = play_type)) +
geom_col() +
facet_grid(~play_type)
rb_trio_plot
```
Now we are adding color by player and separating into small multiples or facets that represent the play type. Any categorical variable could be used in this fashion - you could essentially build the plot 1x and then facet by a factor to generate N versions of that graph all plotted together.
### Let's raise the bar
```{r}
bal_colors <- nflfastR::teams_colors_logos %>%
filter(team_abbr == "BAL")
rb_trio_plot +
geom_hline(yintercept = 0.03, color = "black", size = 2) +
theme_538() +
scale_fill_manual(values = c(bal_colors$team_color, bal_colors$team_color2, bal_colors$team_color3)) +
labs(
x = "",
y = "Avg Yards per Play",
title = "Ingram was the most efficient receiver, while Edwards was more efficient rushing",
subtitle = "Hill was a distant third in both categories",
caption = "Data: @nflfastR | Plot: @thomas_mock"
) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 1),
panel.ontop = TRUE,
legend.position = "none"
) +
scale_y_continuous(
breaks = seq(0, 8, 1)
)
```
### Or keep it more traditional
```{r}
rb_trio_plot +
geom_hline(yintercept = 0, color = "black", size = 2) +
theme_538() +
scale_fill_manual(values = c(bal_colors$team_color, bal_colors$team_color2, bal_colors$team_color3)) +
labs(
x = "",
y = "Avg Yards per Play",
title = "Ingram was the most efficient receiver, while Edwards was more efficient rushing",
subtitle = "Hill was a distant third in both categories",
caption = "Data: @nflfastR | Plot: @thomas_mock"
) +
theme(
panel.grid.major.x = element_blank(),
legend.position = "none"
) +
scale_y_continuous(
breaks = seq(0, 8, 1)
)
```
### Flip the bar
```{r}
epa_play <- pbp %>%
filter(pass == 1, !is.na(posteam)) %>%
group_by(posteam) %>%
summarize(
n = n(),
epa_per_db = sum(epa, na.rm = TRUE) / n,
success_rate = sum(epa, na.rm = TRUE) / n
)
epa_play %>%
ggplot(aes(x = posteam, y = epa_per_db)) +
geom_col()
```
This could be a useful summary, but there's a few issues.
* Teams arranged by alphabetical name, which is not that useful
* x-axis is hard to read (worse so if you had full team names)
So let's try rotating the bar plot.
```{r}
epa_play %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db), )) +
geom_col()
```
Note that this works perfectly in `ggplot2` version 3.2 or greater! Instead of just swapping the x and y axes, we could also have used `coord_flip()` - this will rotate the plot as well.
```{r}
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "green", "red"))) +
coord_flip() +
scale_fill_identity()
```
Now this is more readable, clearly arranged by the strong passing vs weak passing teams, but still could be improved. Namely, red/green is not ideal for [color-blindness](https://www.visualisingdata.com/2015/11/colour-swatch-alternatives-to-green-and-red/), and the default red/green are pretty abrasively bright! However, we can still improve the grid lines (don't need horizontal), add some better labels, and finish out the plot.
```{r}
epa_play %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db))) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
scale_fill_identity() +
theme_538() +
theme(panel.grid.major.y = element_blank()) +
geom_hline(yintercept = 0) +
scale_x_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflfastR | Plot: @thomas_mock"
)
```
### Bar plot alternatives
There are some alternative reproducible methods for various bar plots on one of my [other guides](https://gist.github.com/jthomasmock/2db9db2c534a48af9e2330758be90b8b).
#### How about a lollipop?
Basic form:
* `ggplot(aes(x = category, y = value)) + geom_col(width = 0.2) + geom_point()`
```{r}
epa_play %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db))) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
width = 0.2
) +
geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
size = 5
) +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(panel.grid.major.y = element_blank()) +
geom_hline(yintercept = 0) +
scale_x_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflfastR | Plot: @thomas_mock"
)
```
#### Or a direct labeled bar
```{r}
epa_play %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db))) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
geom_text(aes(
label = posteam,
color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"),
hjust = if_else(epa_per_db > 0, -0.1, 1.1)
),
fontface = "bold"
) +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_blank()
) +
geom_hline(yintercept = 0) +
scale_x_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflfastR | Plot: @thomas_mock"
)
```
#### Or dropping the bar completely
```{r}
epa_play %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db))) +
geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
size = 3
) +
geom_text(aes(
label = posteam,
color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")
),
nudge_x = if_else(epa_play$epa_per_db > 0, 0.02, -0.02),
fontface = "bold"
) +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_blank()
) +
geom_hline(yintercept = 0) +
scale_x_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflfastR | Plot: @thomas_mock"
)
```
In this case, the Y-axis is essentially rank - you could also revert back to just doing this as team logos or adding another variable on the y-axis. This plot is ink efficient, but also has a LOT of unused white space as a result. As such, I don't think it is a "great" plot.
### Team Logo Instead
You can use the `ggimage` package to plot team logos. Note that for many plots you'll want to explicitly use an aspect ratio to prevent the logos from getting skewed. Note that I'm defining the `asp_ratio` first, and then using it in both `geom_image()` and in `theme()`.
```
geom_image(aes(image = team_logo_wikipedia), size = 0.035, by = "width", asp = asp_ratio)
```
```
# Inverted asp_ratio used in theme
# this may be necessary on windows
aspect.ratio = 1/asp_ratio
```
```{r}
library(ggimage)
asp_ratio <- 1.618
epa_logos <- epa_play %>%
left_join(nflfastR::teams_colors_logos, by = c("posteam" = "team_abbr")) %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db))) +
geom_image(aes(image = team_logo_wikipedia), size = 0.035, by = "width", asp = asp_ratio) +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_blank(),
aspect.ratio = 1/asp_ratio
) +
geom_hline(yintercept = 0) +
scale_x_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflfastR | Plot: @thomas_mock"
)
epa_logos
```
When you go to save, make sure to also use the aspect ratio here.
```{r}
asp_ratio <- 1.618
width <- 10
ggsave("logo_plot_asp.png", epa_logos, width = width, height = width/asp_ratio, dpi = "retina")
# load the saved image back into our RMarkdown doc
knitr::include_graphics("logo_plot_asp.png")
```
## Scatter plots
Back to stealing from Ben - who has done a great job generating interesting scatter plots.
Step 2 generates our summary dataframe with a few plays of interest. ALWAYS remember to add an `ungroup()` as otherwise the grouped assignment lives on in the dataset.
```{r}
qbs <- pbp %>%
filter(
play_type %in% c("pass", "run"),
penalty == 0,
!is.na(epa)
) %>%
group_by(name, posteam) %>%
summarize(
n_dropbacks = sum(pass),
n_rush = sum(rush),
n_plays = n(),
epa_per_play = sum(epa) / n_plays,
success_per_play = sum(success) / n_plays
) %>%
filter(n_dropbacks >= 100) %>%
ungroup() # always ungroup if you no longer need the grouping effect
```
### Basic Scatterplot
Basic form:
* `ggplot(aes(x = value, y = other_value)) + geom_point()`
```{r}
qb_success_rate <- qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
geom_point() +
labs(x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2019, min 100 pass attempts, includes all QB's rush and pass plays") +
theme_bw() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12))
qb_success_rate
```
This is a nice plot, minorly scaled back from Ben's example code. There is a clear linear relationship between succcess rate (EPA > 0) and EPA per Play, which makes sense.
### Add reference lines
We could add back in a few of Ben's code examples to improve it.
```{r}
qb_success_rate +
geom_hline(yintercept = mean(qbs$epa_per_play), color = "red", linetype = "dashed") +
geom_vline(xintercept = mean(qbs$success_per_play), color = "red", linetype = "dashed")
```
This adds lines at the averages for each axis to help with comparison.
We could also accomplish this with the code below. In the below example, it is initially more verbose but also gives you a saved data point to work with, and could be useful if for example you wanted to do a `group_by` summary or a `filter`, basically anything beyond just a pure mean.
```{r}
qb_epa_per_play <- qbs %>%
summarize(mean = mean(epa_per_play)) %>%
pull(mean)
qb_success_per_play <- qbs %>%
summarize(mean = mean(success_per_play)) %>%
pull(mean)
qb_success_rate +
geom_hline(yintercept = qb_epa_per_play, color = "red", linetype = "dashed") +
geom_vline(xintercept = qb_success_per_play, color = "red", linetype = "dashed")
```
### Add linear trendline
We could also add a linear trendline to this plot. Either method shown below is valid, but stat_smooth allows for some additional customization.
```{r}
qb_success_rate +
stat_smooth(method = "lm", geom = "line", alpha = 0.5, se = FALSE, color = "red", size = 1)
```
```{r}
qb_success_rate +
geom_smooth(method = "lm", se = FALSE, color = "red")
```
### More than 2 Variables
Now Ben has 2x variables assigned as aesthetics in this plot, success rate as X, EPA/play as Y.
He also added a 3rd variable (size) as an aesthetic. Importantly, because we are putting size and color INSIDE `aes()` we get to use traditional `tidyverse` evaluation, so we can reference columns directly, like you see with n_plays and posteam.
```{r}
qbs %>%
ggplot(
aes(x = success_per_play, y = epa_per_play)
) +
# Notice that color/size inside aes()
geom_point(
aes(
color = if_else(posteam == "NO", "gold", "black"),
size = n_plays / 60
),
alpha = 0.50
) +
# we need this to assign red/black to the actual color
scale_color_identity() +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2019, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)