library(tidyverse)
library(tidymodels)Unit 9 (Advanced Models: Decision Trees, Bagging Trees, and Random Forest)
Assignment overview
This is the key for the unit 9 assignment predicting overall ratings of soccer players from a variety of characteristics about the player (e.g., age, nationality) and their soccer playing skills (e.g., stats on balance, dribbling skills, etc.)
Setup
Load in your tidy packages
Set up your conflicts policy
options(conflicts.policy = "depends.ok")
conflictRules("Matrix", mask.ok = c("expand", "pack", "unpack"))Load additional packages
library(ranger)
library(xfun, include.only = "cache_rds")
library(future)Source function scripts (John’s or your own)
source("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true")
source("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")
source("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true")Specify global settings
Since we are going to use cache_rds(), we are also going to include rerun_setting <- FALSE in this chunk
theme_set(theme_classic())
options(tibble.width = Inf, dplyr.print_max=Inf)
rerun_setting <- FALSESet up parallel processing
Here we demonstrate initializing parallel processing throughout the entire script. Alternatively, you can initiate parallel processing only around chunks of code you know you want to run in parallel, as we have demonstrated in lab.
plan(multisession, workers = parallel::detectCores(logical = FALSE))Create file path object
path_data <- "application_assignments/unit_09"Prep data
First we prepare the data to be used to fit random forest models predicting overall player ratings from all predictor variables.
Read in fifa.csv
data_all <- read_csv(here::here(path_data, "fifa.csv"), col_types = cols()) |>
glimpse()Rows: 2,000
Columns: 18
$ name <chr> "j_george", "j_abu_hanna", "n_stefanelli", "b…
$ nationality <chr> "germany", "germany", "argentina", "iceland",…
$ age <dbl> 25, 20, 24, 27, 29, 28, 17, 36, 30, 19, 24, 2…
$ weight <dbl> 165, 176, 157, 190, 161, 207, 181, 179, 161, …
$ dribbling <dbl> 73, 46, 70, 73, 37, 15, 29, 64, 66, 62, 79, 6…
$ interceptions <dbl> 21, 58, 39, 42, 68, 17, 43, 63, 65, 34, 68, 7…
$ heading_accuracy <dbl> 59, 60, 55, 68, 76, 13, 53, 74, 61, 40, 52, 6…
$ reactions <dbl> 67, 57, 60, 74, 65, 54, 43, 63, 57, 56, 74, 7…
$ balance <dbl> 85, 66, 92, 76, 62, 41, 60, 61, 69, 73, 91, 7…
$ jumping <dbl> 61, 77, 91, 78, 75, 61, 51, 88, 72, 51, 40, 7…
$ stamina <dbl> 83, 60, 73, 90, 64, 39, 46, 64, 77, 58, 80, 7…
$ strength <dbl> 72, 77, 55, 85, 76, 80, 54, 88, 64, 52, 69, 7…
$ composure <dbl> 59, 48, 56, 76, 58, 41, 46, 78, 61, 60, 78, 6…
$ preferred_foot <chr> "right", "left", "right", "right", "right", "…
$ international_reputation <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, …
$ work_rate <chr> "high_medium", "medium_medium", "medium_high"…
$ position <chr> "am", "df", "st", "st", "df", "gk", "df", "df…
$ overall <dbl> 69, 63, 67, 73, 70, 61, 50, 71, 64, 59, 75, 7…
# light cleaning EDA
data_all <- data_all |>
janitor::clean_names("snake") |>
mutate(across(where(is.character), factor)) |>
mutate(across(where(is.character), tidy_responses)) |>
glimpse()Rows: 2,000
Columns: 18
$ name <fct> j_george, j_abu_hanna, n_stefanelli, b_sigur_…
$ nationality <fct> germany, germany, argentina, iceland, colombi…
$ age <dbl> 25, 20, 24, 27, 29, 28, 17, 36, 30, 19, 24, 2…
$ weight <dbl> 165, 176, 157, 190, 161, 207, 181, 179, 161, …
$ dribbling <dbl> 73, 46, 70, 73, 37, 15, 29, 64, 66, 62, 79, 6…
$ interceptions <dbl> 21, 58, 39, 42, 68, 17, 43, 63, 65, 34, 68, 7…
$ heading_accuracy <dbl> 59, 60, 55, 68, 76, 13, 53, 74, 61, 40, 52, 6…
$ reactions <dbl> 67, 57, 60, 74, 65, 54, 43, 63, 57, 56, 74, 7…
$ balance <dbl> 85, 66, 92, 76, 62, 41, 60, 61, 69, 73, 91, 7…
$ jumping <dbl> 61, 77, 91, 78, 75, 61, 51, 88, 72, 51, 40, 7…
$ stamina <dbl> 83, 60, 73, 90, 64, 39, 46, 64, 77, 58, 80, 7…
$ strength <dbl> 72, 77, 55, 85, 76, 80, 54, 88, 64, 52, 69, 7…
$ composure <dbl> 59, 48, 56, 76, 58, 41, 46, 78, 61, 60, 78, 6…
$ preferred_foot <fct> right, left, right, right, right, right, righ…
$ international_reputation <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, …
$ work_rate <fct> high_medium, medium_medium, medium_high, high…
$ position <fct> am, df, st, st, df, gk, df, df, mf, mf, mf, m…
$ overall <dbl> 69, 63, 67, 73, 70, 61, 50, 71, 64, 59, 75, 7…
# no missing data
data_all |>
skim_some()| Name | data_all |
| Number of rows | 2000 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| name | 0 | 1 | FALSE | 1984 | a_d: 2, a_m: 2, j_c: 2, j_h: 2 |
| nationality | 0 | 1 | FALSE | 109 | eng: 179, ger: 139, spa: 104, arg: 101 |
| preferred_foot | 0 | 1 | FALSE | 2 | rig: 1539, lef: 461 |
| work_rate | 0 | 1 | FALSE | 9 | med: 1069, hig: 357, med: 170, hig: 134 |
| position | 0 | 1 | FALSE | 6 | df: 657, mf: 476, st: 314, gk: 223 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | p0 | p100 |
|---|---|---|---|---|
| age | 0 | 1 | 16 | 40 |
| weight | 0 | 1 | 117 | 243 |
| dribbling | 0 | 1 | 5 | 95 |
| interceptions | 0 | 1 | 5 | 92 |
| heading_accuracy | 0 | 1 | 8 | 92 |
| reactions | 0 | 1 | 30 | 90 |
| balance | 0 | 1 | 19 | 96 |
| jumping | 0 | 1 | 29 | 94 |
| stamina | 0 | 1 | 13 | 94 |
| strength | 0 | 1 | 26 | 97 |
| composure | 0 | 1 | 12 | 93 |
| international_reputation | 0 | 1 | 1 | 4 |
| overall | 0 | 1 | 46 | 91 |
Initial train/test split
set.seed(123456)
splits_test <- data_all |>
initial_split(prop = 2/3, strata = "overall")
data_trn <- splits_test |>
analysis()
data_test <- splits_test |>
assessment()Recipe
Decision trees and random forests have low feature engineering requirements (not to say that some feature engineering can’t help your model!). Since we have no missing data, the recipes for our decision tree and random forest models can be the same. If we had missing data, we would need to impute or do another strategy in order to use the recipe with random forest.
The only step you have to do is make sure that the name variable (listing the names of the players) is not used as a predictor in your models. You can also remove name during cleaning if you wouldn’t be using it at all.
step_other() is not required, but remember that while you don’t NEED to do anything with your factors, performance may still be improved by collapsing levels, etc. In this case, step_other() with nationality can lower the variance of the solution because it collapses a variable that has a large amount of levels.
rec <- recipe(overall ~ ., data = data_trn) |>
step_rm(name) |>
step_other(nationality, threshold = .05)
rec_prep <- rec |>
prep(data_trn)
feat_trn <- rec_prep |>
bake(NULL)
feat_trn |> skim_some()| Name | feat_trn |
| Number of rows | 1331 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| nationality | 0 | 1 | FALSE | 5 | oth: 969, eng: 126, ger: 91, spa: 73 |
| preferred_foot | 0 | 1 | FALSE | 2 | rig: 1026, lef: 305 |
| work_rate | 0 | 1 | FALSE | 9 | med: 713, hig: 226, med: 121, hig: 88 |
| position | 0 | 1 | FALSE | 6 | df: 446, mf: 308, st: 210, gk: 158 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | p0 | p100 |
|---|---|---|---|---|
| age | 0 | 1 | 16 | 40 |
| weight | 0 | 1 | 117 | 243 |
| dribbling | 0 | 1 | 5 | 95 |
| interceptions | 0 | 1 | 5 | 88 |
| heading_accuracy | 0 | 1 | 8 | 92 |
| reactions | 0 | 1 | 30 | 90 |
| balance | 0 | 1 | 20 | 96 |
| jumping | 0 | 1 | 29 | 94 |
| stamina | 0 | 1 | 13 | 93 |
| strength | 0 | 1 | 26 | 97 |
| composure | 0 | 1 | 12 | 91 |
| international_reputation | 0 | 1 | 1 | 4 |
| overall | 0 | 1 | 46 | 91 |
Decision tree
Let’s start out by fitting a standard decision tree.
First, we assign bootstrap splits. You would want to consider more bootstraps but to lower computational costs we do 100.
set.seed(123456)
splits_boot <- data_trn |>
bootstraps(times = 100, strata = "overall")
splits_boot# Bootstrap sampling using stratification
# A tibble: 100 × 2
splits id
<list> <chr>
1 <split [1331/488]> Bootstrap001
2 <split [1331/482]> Bootstrap002
3 <split [1331/505]> Bootstrap003
4 <split [1331/498]> Bootstrap004
5 <split [1331/503]> Bootstrap005
6 <split [1331/473]> Bootstrap006
7 <split [1331/505]> Bootstrap007
8 <split [1331/453]> Bootstrap008
9 <split [1331/485]> Bootstrap009
10 <split [1331/474]> Bootstrap010
11 <split [1331/492]> Bootstrap011
12 <split [1331/483]> Bootstrap012
13 <split [1331/498]> Bootstrap013
14 <split [1331/500]> Bootstrap014
15 <split [1331/476]> Bootstrap015
16 <split [1331/490]> Bootstrap016
17 <split [1331/490]> Bootstrap017
18 <split [1331/477]> Bootstrap018
19 <split [1331/512]> Bootstrap019
20 <split [1331/489]> Bootstrap020
21 <split [1331/497]> Bootstrap021
22 <split [1331/478]> Bootstrap022
23 <split [1331/502]> Bootstrap023
24 <split [1331/489]> Bootstrap024
25 <split [1331/480]> Bootstrap025
26 <split [1331/482]> Bootstrap026
27 <split [1331/460]> Bootstrap027
28 <split [1331/496]> Bootstrap028
29 <split [1331/491]> Bootstrap029
30 <split [1331/501]> Bootstrap030
31 <split [1331/496]> Bootstrap031
32 <split [1331/486]> Bootstrap032
33 <split [1331/493]> Bootstrap033
34 <split [1331/499]> Bootstrap034
35 <split [1331/479]> Bootstrap035
36 <split [1331/499]> Bootstrap036
37 <split [1331/501]> Bootstrap037
38 <split [1331/492]> Bootstrap038
39 <split [1331/503]> Bootstrap039
40 <split [1331/507]> Bootstrap040
41 <split [1331/488]> Bootstrap041
42 <split [1331/491]> Bootstrap042
43 <split [1331/500]> Bootstrap043
44 <split [1331/499]> Bootstrap044
45 <split [1331/504]> Bootstrap045
46 <split [1331/503]> Bootstrap046
47 <split [1331/498]> Bootstrap047
48 <split [1331/491]> Bootstrap048
49 <split [1331/486]> Bootstrap049
50 <split [1331/506]> Bootstrap050
51 <split [1331/491]> Bootstrap051
52 <split [1331/488]> Bootstrap052
53 <split [1331/494]> Bootstrap053
54 <split [1331/503]> Bootstrap054
55 <split [1331/477]> Bootstrap055
56 <split [1331/490]> Bootstrap056
57 <split [1331/493]> Bootstrap057
58 <split [1331/473]> Bootstrap058
59 <split [1331/482]> Bootstrap059
60 <split [1331/474]> Bootstrap060
61 <split [1331/496]> Bootstrap061
62 <split [1331/499]> Bootstrap062
63 <split [1331/486]> Bootstrap063
64 <split [1331/463]> Bootstrap064
65 <split [1331/471]> Bootstrap065
66 <split [1331/497]> Bootstrap066
67 <split [1331/501]> Bootstrap067
68 <split [1331/477]> Bootstrap068
69 <split [1331/501]> Bootstrap069
70 <split [1331/499]> Bootstrap070
71 <split [1331/486]> Bootstrap071
72 <split [1331/491]> Bootstrap072
73 <split [1331/493]> Bootstrap073
74 <split [1331/486]> Bootstrap074
75 <split [1331/457]> Bootstrap075
76 <split [1331/481]> Bootstrap076
77 <split [1331/471]> Bootstrap077
78 <split [1331/472]> Bootstrap078
79 <split [1331/472]> Bootstrap079
80 <split [1331/497]> Bootstrap080
81 <split [1331/486]> Bootstrap081
82 <split [1331/488]> Bootstrap082
83 <split [1331/498]> Bootstrap083
84 <split [1331/505]> Bootstrap084
85 <split [1331/490]> Bootstrap085
86 <split [1331/482]> Bootstrap086
87 <split [1331/479]> Bootstrap087
88 <split [1331/481]> Bootstrap088
89 <split [1331/473]> Bootstrap089
90 <split [1331/488]> Bootstrap090
91 <split [1331/483]> Bootstrap091
92 <split [1331/487]> Bootstrap092
93 <split [1331/483]> Bootstrap093
94 <split [1331/496]> Bootstrap094
95 <split [1331/506]> Bootstrap095
96 <split [1331/498]> Bootstrap096
97 <split [1331/486]> Bootstrap097
98 <split [1331/494]> Bootstrap098
99 <split [1331/467]> Bootstrap099
100 <split [1331/481]> Bootstrap100
Then, we set up our grid to search for hyperparameters.
Note: You would want tune on tree_depth but we have you set it to 10 to lower computational costs for this assignment. - A shallow tree_depth can lead to a biased model! - min_n is the minimum number of observations in a node needed to split (decision trees definition) - cost_complexity: prunes down the tree, the larger the value, the more pruning - With levels = 4 we get 64 combinations of values (4 x 4 x 4) if we included tree_depth
grid_tree <- grid_regular(cost_complexity(), min_n(), levels = 4)
grid_tree# A tibble: 16 × 2
cost_complexity min_n
<dbl> <int>
1 0.0000000001 2
2 0.0000001 2
3 0.0001 2
4 0.1 2
5 0.0000000001 14
6 0.0000001 14
7 0.0001 14
8 0.1 14
9 0.0000000001 27
10 0.0000001 27
11 0.0001 27
12 0.1 27
13 0.0000000001 40
14 0.0000001 40
15 0.0001 40
16 0.1 40
Now let’s fit our decision tree.
fits_tree <-
cache_rds(
expr = {
decision_tree(
cost_complexity = tune(),
tree_depth = 10,
min_n = tune()) |>
set_engine("rpart") |>
set_mode("regression") |>
tune_grid(preprocessor = rec,
resamples = splits_boot,
grid = grid_tree,
metrics = metric_set(rmse))
},
rerun = rerun_setting,
dir = "cache/",
file = "fits_tree")We can use autoplot() to view performance by hyperparameter values.
autoplot(fits_tree)There was no large difference between 27 and 40 so no further benefits of increasing node size beyond 40, meaning that we considered reasonable values. With respect to cost complexity, it is already a very near zero value (so essentially no penalty), so we can’t really go lower than that. We are almost not pruning at all.
Let’s take a look at our best performances .
show_best(fits_tree)Warning in show_best(fits_tree): No value of `metric` was given; "rmse" will be
used.
# A tibble: 5 × 8
cost_complexity min_n .metric .estimator mean n std_err .config
<dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0000000001 40 rmse standard 3.34 100 0.0140 pre0_mod04_post0
2 0.0000001 40 rmse standard 3.34 100 0.0140 pre0_mod08_post0
3 0.0001 40 rmse standard 3.34 100 0.0140 pre0_mod12_post0
4 0.0001 27 rmse standard 3.35 100 0.0137 pre0_mod11_post0
5 0.0000000001 27 rmse standard 3.35 100 0.0137 pre0_mod03_post0
Now, we’ll train our best model on the training set.
best_decision_tree <-
decision_tree(cost_complexity = select_best(fits_tree)$cost_complexity,
tree_depth = 10,
min_n = select_best(fits_tree)$min_n) |>
set_engine("rpart", model = TRUE) |>
set_mode("regression") |>
fit(overall ~ ., data = feat_trn)Warning in select_best(fits_tree): No value of `metric` was given; "rmse" will be used.
No value of `metric` was given; "rmse" will be used.
Let’s plot our decision tree and take a look at it!
best_decision_tree$fit |> rpart.plot::rpart.plot()best_decision_tree$fit |> rpart.plot::rpart.plot(fallen.leaves = FALSE)Random Forest
Now let’s see what bagging and decorrelating the base learner trees can do for our model’s performance. Since bagging performs bootstrapping internally and we are using bootstrapping for selection of tuning parameters, we have a good old fashioned double bootstrap!
Let’s make bootstrap splits for selection.
set.seed(123456)
splits_boot <- data_trn |>
bootstraps(times = 100, strata = "overall")Now, let’s make tuning grid and fit random forest.
Note: You likely want to consider more bootstraps than 100 and greater trees than 95 – these were just limited for computational purposes during this assignment.
- mtry is the number of variables to randomly sample as candidates for each split
# mtry should go up to number of predictors
grid_rf <- expand_grid(mtry = c(2, 5, 10, 16),
min_n = c(1, 2, 5, 10))
grid_rf# A tibble: 16 × 2
mtry min_n
<dbl> <dbl>
1 2 1
2 2 2
3 2 5
4 2 10
5 5 1
6 5 2
7 5 5
8 5 10
9 10 1
10 10 2
11 10 5
12 10 10
13 16 1
14 16 2
15 16 5
16 16 10
# Fit random forest
fits_rf <-
cache_rds(
expr = {
rand_forest(
trees = 95,
mtry = tune(),
min_n = tune()) |>
set_engine("ranger",
respect.unordered.factors = "order",
oob.error = FALSE,
seed = 102030) |> # We set seed = to generate reproducible bootstraps
set_mode("regression") |>
tune_grid(preprocessor = rec,
resamples = splits_boot,
grid = grid_rf,
metrics = metric_set(rmse))
},
rerun = rerun_setting,
dir = "cache/",
file = "fits_rf")
collect_metrics(fits_rf, summarize = TRUE)# A tibble: 16 × 8
mtry min_n .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2 1 rmse standard 2.70 100 0.0110 pre0_mod01_post0
2 2 2 rmse standard 2.71 100 0.0111 pre0_mod02_post0
3 2 5 rmse standard 2.72 100 0.0110 pre0_mod03_post0
4 2 10 rmse standard 2.73 100 0.0114 pre0_mod04_post0
5 5 1 rmse standard 2.54 100 0.0101 pre0_mod05_post0
6 5 2 rmse standard 2.54 100 0.0100 pre0_mod06_post0
7 5 5 rmse standard 2.54 100 0.0101 pre0_mod07_post0
8 5 10 rmse standard 2.55 100 0.0101 pre0_mod08_post0
9 10 1 rmse standard 2.58 100 0.00958 pre0_mod09_post0
10 10 2 rmse standard 2.58 100 0.00943 pre0_mod10_post0
11 10 5 rmse standard 2.58 100 0.00940 pre0_mod11_post0
12 10 10 rmse standard 2.59 100 0.00943 pre0_mod12_post0
13 16 1 rmse standard 2.63 100 0.00951 pre0_mod13_post0
14 16 2 rmse standard 2.63 100 0.00953 pre0_mod14_post0
15 16 5 rmse standard 2.63 100 0.00955 pre0_mod15_post0
16 16 10 rmse standard 2.63 100 0.00942 pre0_mod16_post0
Now we’ll look at our hyperparameters. We use these plots to confirm that we selected good combinations of hyperparameters to tune and that the best hyperparameters are inside the range of values considered (or at their objective edge). We can see a nice dip in our plot – which suggests that we’ve picked a good enough range!
autoplot(fits_rf)Let’s look at our best tuning values and RMSE.
show_best(fits_rf)Warning in show_best(fits_rf): No value of `metric` was given; "rmse" will be
used.
# A tibble: 5 × 8
mtry min_n .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 1 rmse standard 2.54 100 0.0101 pre0_mod05_post0
2 5 2 rmse standard 2.54 100 0.0100 pre0_mod06_post0
3 5 5 rmse standard 2.54 100 0.0101 pre0_mod07_post0
4 5 10 rmse standard 2.55 100 0.0101 pre0_mod08_post0
5 10 2 rmse standard 2.58 100 0.00943 pre0_mod10_post0
And finally, we’ll fit our best model on the full training data.
fit_rf_best <-
rand_forest(trees = 10,
mtry = select_best(fits_rf)$mtry,
min_n = select_best(fits_rf)$min_n) |>
set_engine("ranger",
respect.unordered.factors = "order",
oob.error = FALSE,
seed = 102030) |>
set_mode("regression") |>
fit(overall ~ ., data = feat_trn)Warning in select_best(fits_rf): No value of `metric` was given; "rmse" will be used.
No value of `metric` was given; "rmse" will be used.
Evaluate best model in test
Based on bootstrapped training rmse, we can see that the bagging and decorrelating provided by random forest was helpful for reducing model rmse. So, we’ll evaluate our best random forest model in test!
feat_test <- rec_prep |>
bake(data_test)
rmse_vec(truth = feat_test$overall,
estimate = predict(fit_rf_best, feat_test)$.pred)[1] 2.509799
Nice – test rmse is not far from training rmse, suggesting our model is not overfit!!!
Here is a visualization of how well the model is doing! Not bad at all. :^)
plot_truth(truth = feat_test$overall,
estimate = predict(fit_rf_best, feat_test)$.pred)End parallel processing
Let’s go back to sequential processing now that we’ve finished our assignment!
plan(sequential)