library(tidyverse)
library(tidymodels)Homework Unit 5: Resampling
Introduction
In this assignment, you will select among model configurations using resampling methods. This includes training models with 2 different feature sets (specified in your recipe) and tuning hyperparameters (k in KNN).
Set up
Load required packages
Handle conflicts
options(conflicts.policy = "depends.ok")Load in additional packages (only what you need)
library(future)
library(xfun, include.only = "cache_rds")Source function scripts (John’s or your own)
source("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true")
source("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true")
source("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")Specify other global settings
Since we are going to use cache_rds(), we are also going to include rerun_setting <- FALSE in this chunk
theme_set(theme_bw())
options(tibble.width = Inf, dplyr.print_max=Inf)
rerun_setting <- FALSEPaths
path_data <- "application_assignments/unit_05"Read in data
Read in the smoking_ema.csv data file.
Note you can set column types (col_types = cols()) or set show_col_types = FALSE if you don’t want the read_csv() output which is redundant with some of the glimpse() output!
data_all <- read_csv(here::here(path_data, "smoking_ema.csv"),
show_col_types = FALSE) |>
glimpse()Rows: 5,876
Columns: 13
$ subid <dbl> 11011, 11011, 11011, 11011, 11011, 11011, 11011, 11011…
$ cigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ craving <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 4, …
$ anxious <dbl> 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ irritable <dbl> 1.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ sad <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happy <dbl> 0.0, 0.0, 3.0, 2.0, 2.0, 1.0, 4.0, 3.0, 3.0, 2.0, 3.0,…
$ stressful_event <dbl> 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ positive_event <dbl> 0, 0, 2, 1, 0, 1, 4, 0, 0, 1, 3, 0, 0, 2, 0, 1, 0, 1, …
$ txt <chr> "nrt", "nrt", "nrt", "nrt", "nrt", "nrt", "nrt", "nrt"…
$ prior_cigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ lag <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.152222, 3.4…
$ time_since_quit <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.6630208,…
Lets skim the data to check mins, maxs, and missing values.
data_all |>
skim_some()| Name | data_all |
| Number of rows | 5876 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| txt | 0 | 1 | 3 | 7 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | p0 | p100 |
|---|---|---|---|---|
| subid | 0 | 1 | 11011.00 | 22149.00 |
| cigs | 0 | 1 | 0.00 | 1.00 |
| craving | 0 | 1 | 0.00 | 4.00 |
| anxious | 0 | 1 | 0.00 | 4.00 |
| irritable | 0 | 1 | 0.00 | 4.00 |
| sad | 0 | 1 | 0.00 | 4.00 |
| happy | 0 | 1 | 0.00 | 4.00 |
| stressful_event | 0 | 1 | 0.00 | 4.00 |
| positive_event | 0 | 1 | 0.00 | 4.00 |
| prior_cigs | 0 | 1 | 0.00 | 1.00 |
| lag | 0 | 1 | 0.09 | 108.00 |
| time_since_quit | 0 | 1 | 0.10 | 30.03 |
Next we are going to use the data dictionary to class variables
We are doing this on the whole dataset prior to splitting it, so no need for a function (i.e., we are only using this code once!)
valence_labels <- c("not at all", "mildly", "moderately", "very", "extremely")
event_labels <- c("no", "mild", "moderate", "very", "extreme")
data_all <- data_all |>
mutate(cigs = factor(cigs, levels = 0:1, labels = c("no", "yes")),
across(c(craving, anxious, irritable, sad, happy),
~ factor(.x, levels = 0:4, labels = valence_labels)),
across(contains("_event"),
~ factor(.x, levels = 0:4, labels = event_labels)),
txt = factor(txt, levels = c("placebo", "nrt")),
prior_cigs = factor(prior_cigs, levels = 0:1, labels = c("no", "yes"))) |>
glimpse()Rows: 5,876
Columns: 13
$ subid <dbl> 11011, 11011, 11011, 11011, 11011, 11011, 11011, 11011…
$ cigs <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ craving <fct> moderately, not at all, not at all, mildly, not at all…
$ anxious <fct> not at all, moderately, not at all, not at all, not at…
$ irritable <fct> mildly, not at all, not at all, not at all, very, mild…
$ sad <fct> not at all, not at all, not at all, not at all, not at…
$ happy <fct> not at all, not at all, very, moderately, moderately, …
$ stressful_event <fct> no, moderate, no, no, no, moderate, no, no, no, no, no…
$ positive_event <fct> no, no, moderate, mild, no, mild, extreme, no, no, mil…
$ txt <fct> nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt,…
$ prior_cigs <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ lag <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.152222, 3.4…
$ time_since_quit <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.6630208,…
Lets skim again to check our new variables
data_all |>
skim_some()| Name | data_all |
| Number of rows | 5876 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cigs | 0 | 1.00 | FALSE | 2 | no: 4798, yes: 1078 |
| craving | 71 | 0.99 | FALSE | 5 | mil: 1766, not: 1669, mod: 1344, ver: 594 |
| anxious | 56 | 0.99 | FALSE | 5 | not: 2913, mil: 1829, mod: 788, ver: 186 |
| irritable | 61 | 0.99 | FALSE | 5 | not: 3131, mil: 1679, mod: 649, ver: 234 |
| sad | 35 | 0.99 | FALSE | 5 | not: 4176, mil: 1081, mod: 380, ver: 130 |
| happy | 79 | 0.99 | FALSE | 5 | mil: 1792, mod: 1779, not: 1203, ver: 827 |
| stressful_event | 0 | 1.00 | FALSE | 5 | no: 3776, mil: 1057, mod: 620, ver: 264 |
| positive_event | 0 | 1.00 | FALSE | 5 | no: 3108, mil: 1228, mod: 1006, ver: 401 |
| txt | 0 | 1.00 | FALSE | 2 | nrt: 2983, pla: 2893 |
| prior_cigs | 0 | 1.00 | FALSE | 2 | no: 4783, yes: 1093 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | p0 | p100 |
|---|---|---|---|---|
| subid | 0 | 1 | 11011.00 | 22149.00 |
| lag | 0 | 1 | 0.09 | 108.00 |
| time_since_quit | 0 | 1 | 0.10 | 30.03 |
Hmm now we have missing data - Lets investigate
If we re-read in the raw data we can look at the different values for our factor variables
data_all <- read_csv(here::here(path_data, "smoking_ema.csv"),
show_col_types = FALSE)
data_all |>
select(craving:positive_event) |>
map(janitor::tabyl)$craving
.x[[i]] n percent
0.0000000 1669 0.2840367597
0.5000000 16 0.0027229408
0.6666667 1 0.0001701838
0.7142857 1 0.0001701838
0.7500000 1 0.0001701838
1.0000000 1766 0.3005445882
1.1666667 1 0.0001701838
1.2500000 1 0.0001701838
1.3000000 1 0.0001701838
1.3333333 1 0.0001701838
1.4000000 1 0.0001701838
1.5000000 21 0.0035738598
1.6000000 1 0.0001701838
1.6666667 4 0.0006807352
1.7500000 1 0.0001701838
1.8333333 1 0.0001701838
2.0000000 1344 0.2287270252
2.3333333 1 0.0001701838
2.5000000 12 0.0020422056
2.6666667 1 0.0001701838
3.0000000 594 0.1010891763
3.3333333 2 0.0003403676
3.5000000 3 0.0005105514
4.0000000 432 0.0735194010
$anxious
.x[[i]] n percent
0.0000000 2913 0.4957454050
0.3333333 4 0.0006807352
0.5000000 16 0.0027229408
0.7500000 1 0.0001701838
1.0000000 1829 0.3112661675
1.2000000 1 0.0001701838
1.3333333 2 0.0003403676
1.4000000 1 0.0001701838
1.5000000 22 0.0037440436
1.8333333 1 0.0001701838
2.0000000 788 0.1341048332
2.3333333 1 0.0001701838
2.5000000 4 0.0006807352
3.0000000 186 0.0316541865
3.3333333 1 0.0001701838
3.5000000 2 0.0003403676
4.0000000 104 0.0176991150
$irritable
.x[[i]] n percent
0.0000000 3131 0.5328454731
0.2500000 1 0.0001701838
0.3333333 3 0.0005105514
0.5000000 24 0.0040844112
0.6666667 2 0.0003403676
0.8333333 1 0.0001701838
0.9000000 1 0.0001701838
1.0000000 1679 0.2857385977
1.3333333 1 0.0001701838
1.5000000 15 0.0025527570
1.6000000 1 0.0001701838
1.7500000 1 0.0001701838
2.0000000 649 0.1104492852
2.5000000 7 0.0011912866
2.6666667 1 0.0001701838
3.0000000 234 0.0398230088
3.3333333 1 0.0001701838
3.5000000 2 0.0003403676
4.0000000 122 0.0207624234
$sad
.x[[i]] n percent
0.0000000 4176 0.7106875425
0.1666667 1 0.0001701838
0.5000000 19 0.0032334922
0.6666667 1 0.0001701838
1.0000000 1081 0.1839686862
1.5000000 7 0.0011912866
1.6666667 2 0.0003403676
2.0000000 380 0.0646698434
2.4000000 1 0.0001701838
2.5000000 2 0.0003403676
3.0000000 130 0.0221238938
3.3333333 1 0.0001701838
3.5000000 1 0.0001701838
4.0000000 74 0.0125936011
$happy
.x[[i]] n percent
0.0000000 1203 0.2047311096
0.3333333 1 0.0001701838
0.5000000 18 0.0030633084
0.6666667 2 0.0003403676
0.7500000 1 0.0001701838
1.0000000 1792 0.3049693669
1.3333333 3 0.0005105514
1.4000000 1 0.0001701838
1.5000000 25 0.0042545950
1.6666667 3 0.0005105514
2.0000000 1779 0.3027569775
2.3333333 2 0.0003403676
2.5000000 20 0.0034036760
2.7500000 1 0.0001701838
3.0000000 827 0.1407420014
3.3333333 2 0.0003403676
4.0000000 196 0.0333560245
$stressful_event
.x[[i]] n percent
0 3776 0.64261402
1 1057 0.17988428
2 620 0.10551396
3 264 0.04492852
4 159 0.02705922
$positive_event
.x[[i]] n percent
0 3108 0.52893125
1 1228 0.20898570
2 1006 0.17120490
3 401 0.06824370
4 133 0.02263445
So now we see some of our ordinal variables have levels not in the data dictionary (i.e., they are not 0, 1, 2, 3, or 4).
What should we do? Are these errors that need to be corrected? Well thats up to you, the data scientist, to figure out! If you had access to the researchers who collected these data, you might start by discussing this with them. Otherwise, you may need to decide on your own how to proceed.
Some possibilities include, but are not limited to:
- Treat the real (non-integer) numbers as errors, set them to NA, and add a step in your recipe to impute them.
- Keep the variables in their raw numeric form.
- Round the variables to the nearest whole number and then convert to factor.
For example you could use the following code
data_round <- data_all |>
mutate(across(craving:happy, round))
data_round |>
select(craving:happy) |>
map(janitor::tabyl)$craving
.x[[i]] n percent
0 1685 0.28675970
1 1774 0.30190606
2 1385 0.23570456
3 597 0.10159973
4 435 0.07402995
$anxious
.x[[i]] n percent
0 2933 0.49914908
1 1834 0.31211709
2 816 0.13886998
3 187 0.03182437
4 106 0.01803948
$irritable
.x[[i]] n percent
0 3159 0.53761062
1 1684 0.28658952
2 673 0.11453370
3 236 0.04016338
4 124 0.02110279
$sad
.x[[i]] n percent
0 4196 0.71409122
1 1082 0.18413887
2 392 0.06671205
3 131 0.02229408
4 75 0.01276378
$happy
.x[[i]] n percent
0 1222 0.20796460
1 1799 0.30616065
2 1829 0.31126617
3 830 0.14125255
4 196 0.03335602
There might be no right answer in cases like this and you have to make your best guess. Luckily, we know the people that collected this data. It turns out these scores were averaged across different survey responses and as a result non-integer numbers were introduced. Since, we know that these values are valid we are going to proceed by treating them as numeric variables in our models. We are still going to keep stressful_event and positive_event as ordinal variables since they have no non-integer values and class our other character variables as before.
data_all <- data_all |>
mutate(cigs = factor(cigs, levels = 0:1, labels = c("no", "yes")),
across(contains("_event"),
~ factor(.x, levels = 0:4, labels = event_labels)),
txt = factor(txt, levels = c("placebo", "nrt")),
prior_cigs = factor(prior_cigs, levels = 0:1, labels = c("no", "yes"))) |>
glimpse()Rows: 5,876
Columns: 13
$ subid <dbl> 11011, 11011, 11011, 11011, 11011, 11011, 11011, 11011…
$ cigs <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ craving <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 4, …
$ anxious <dbl> 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ irritable <dbl> 1.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ sad <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happy <dbl> 0.0, 0.0, 3.0, 2.0, 2.0, 1.0, 4.0, 3.0, 3.0, 2.0, 3.0,…
$ stressful_event <fct> no, moderate, no, no, no, moderate, no, no, no, no, no…
$ positive_event <fct> no, no, moderate, mild, no, mild, extreme, no, no, mil…
$ txt <fct> nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt,…
$ prior_cigs <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ lag <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.152222, 3.4…
$ time_since_quit <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.6630208,…
Lets skim the data one more time to be sure everything looks good.
data_all |>
skim_some()| Name | data_all |
| Number of rows | 5876 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cigs | 0 | 1 | FALSE | 2 | no: 4798, yes: 1078 |
| stressful_event | 0 | 1 | FALSE | 5 | no: 3776, mil: 1057, mod: 620, ver: 264 |
| positive_event | 0 | 1 | FALSE | 5 | no: 3108, mil: 1228, mod: 1006, ver: 401 |
| txt | 0 | 1 | FALSE | 2 | nrt: 2983, pla: 2893 |
| prior_cigs | 0 | 1 | FALSE | 2 | no: 4783, yes: 1093 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | p0 | p100 |
|---|---|---|---|---|
| subid | 0 | 1 | 11011.00 | 22149.00 |
| craving | 0 | 1 | 0.00 | 4.00 |
| anxious | 0 | 1 | 0.00 | 4.00 |
| irritable | 0 | 1 | 0.00 | 4.00 |
| sad | 0 | 1 | 0.00 | 4.00 |
| happy | 0 | 1 | 0.00 | 4.00 |
| lag | 0 | 1 | 0.09 | 108.00 |
| time_since_quit | 0 | 1 | 0.10 | 30.03 |
Part 1: Selecting among model configurations with k-fold cross-validation
Split data
Split your data into 10 folds, stratified on cigs. Don’t forget to set a seed!
set.seed(021802)
splits_kfold <- data_all |>
vfold_cv(v = 10, repeats = 1, strata = "cigs")
splits_kfold# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [5288/588]> Fold01
2 <split [5288/588]> Fold02
3 <split [5288/588]> Fold03
4 <split [5288/588]> Fold04
5 <split [5288/588]> Fold05
6 <split [5288/588]> Fold06
7 <split [5288/588]> Fold07
8 <split [5288/588]> Fold08
9 <split [5290/586]> Fold09
10 <split [5290/586]> Fold10
Feature Set 1
Build a recipe with the following specifications:
- Use
cigsas the outcome variable regressed oncraving.
- Set up the outcome as a factor with the positive class (“yes”) as the second level.
- Use craving as a feature by simply transforming this ordinal predictor to numeric based on the order of its scores.
Note we already set our outcome variable up as factor with positive class as second level when we classed our data so we aren’t repeating that here. Also we have decided to keep craving as a numeric variable so we don’t have to do any manipulations here either.
rec_kfold_1 <- recipe(cigs ~ craving, data = data_all) As a double check you can prep and bake your recipe on the full data set to see if the variables look correct.
rec_prep <- rec_kfold_1 |>
prep(data_all)
feat_all <- rec_prep |>
bake(data_all)
feat_all |>
glimpse()Rows: 5,876
Columns: 2
$ craving <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 4, 2, 1, 0,…
$ cigs <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
Fit the model with k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_1 as your recipe, and accuracy as your metric.
We are going to also use cache_rds as we fit our models. Hint: wrap your fit_resamples() call inside cache_rds(expr = { ... }, dir = "cache/", file = "meaningful_name", rerun = rerun_setting). See the web book for an example.
plan(multisession, workers = parallel::detectCores(logical = FALSE))
fits_kfold_1 <- cache_rds(
expr = {
logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_kfold_1,
resamples = splits_kfold,
metrics = metric_set(accuracy))
},
dir = "cache/",
file = "fits_kfold_1",
rerun = rerun_setting)
plan(sequential)Examine performance estimates. Use the collect_metrics() function to make a table of the held-out performance estimates from the 10 folds.
metrics_kfold_1 <- collect_metrics(fits_kfold_1, summarize = FALSE)
metrics_kfold_1# A tibble: 10 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold01 accuracy binary 0.816 pre0_mod0_post0
2 Fold02 accuracy binary 0.816 pre0_mod0_post0
3 Fold03 accuracy binary 0.816 pre0_mod0_post0
4 Fold04 accuracy binary 0.816 pre0_mod0_post0
5 Fold05 accuracy binary 0.816 pre0_mod0_post0
6 Fold06 accuracy binary 0.816 pre0_mod0_post0
7 Fold07 accuracy binary 0.816 pre0_mod0_post0
8 Fold08 accuracy binary 0.816 pre0_mod0_post0
9 Fold09 accuracy binary 0.817 pre0_mod0_post0
10 Fold10 accuracy binary 0.817 pre0_mod0_post0
Plot a histogram of the performance estimates. Hint: you can use plot_hist() from fun_plots.R.
metrics_kfold_1 |> plot_hist(".estimate")Print the average performance over folds with the summarize = TRUE argument.
collect_metrics(fits_kfold_1, summarize = TRUE)# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.817 10 0.000144 pre0_mod0_post0
Feature Set 2
Build a second recipe with the following specifications:
- Use
cigsas the outcome variable regressed on all variables.
- Set up the outcome as a factor with the positive class (“yes”) as the second level.
- Set up other variables as factors where needed.
- Apply dummy coding.
- Create an interaction between
stressful_eventandlag. - Remove the
subidvariable withstep_rm()
Note we already classed our character variables as factors when we read the data in so we don’t need to do this again
rec_kfold_2 <- recipe(cigs ~ ., data = data_all) |>
step_mutate_at(c(all_factor_predictors(), -txt, -prior_cigs), fn = as.numeric) |>
step_dummy(txt, prior_cigs) |>
step_interact(~ stressful_event:lag) |>
step_rm(subid)Again check that your recipe is creating features how you expect it should.
rec_prep <- rec_kfold_2 |>
prep(data_all)
feat_all <- rec_prep |>
bake(data_all)
feat_all |>
glimpse()Rows: 5,876
Columns: 13
$ craving <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, …
$ anxious <dbl> 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ irritable <dbl> 1.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0…
$ sad <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happy <dbl> 0.0, 0.0, 3.0, 2.0, 2.0, 1.0, 4.0, 3.0, 3.0, 2.0…
$ stressful_event <dbl> 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ positive_event <dbl> 1, 1, 3, 2, 1, 2, 5, 1, 1, 2, 4, 1, 1, 3, 1, 2, …
$ lag <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.15222…
$ time_since_quit <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.66…
$ cigs <fct> no, no, no, no, no, no, no, no, no, no, no, no, …
$ txt_nrt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ prior_cigs_yes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stressful_event_x_lag <dbl> 6.069444, 9.919167, 4.544444, 11.491111, 6.15222…
Fit the model with k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_2 as your recipe, and accuracy as your metric.
plan(multisession, workers = parallel::detectCores(logical = FALSE))
fits_kfold_2 <- cache_rds(
expr = {
logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_kfold_2,
resamples = splits_kfold,
metrics = metric_set(accuracy))
},
dir = "cache/",
file = "fits_kfold_2",
rerun = rerun_setting)
plan(sequential)Examine performance estimates. Use the collect_metrics() function to make a table of the held-out performance estimates.
metrics_kfold_2 <- collect_metrics(fits_kfold_2, summarize = FALSE)
metrics_kfold_2# A tibble: 10 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold01 accuracy binary 0.861 pre0_mod0_post0
2 Fold02 accuracy binary 0.866 pre0_mod0_post0
3 Fold03 accuracy binary 0.876 pre0_mod0_post0
4 Fold04 accuracy binary 0.850 pre0_mod0_post0
5 Fold05 accuracy binary 0.838 pre0_mod0_post0
6 Fold06 accuracy binary 0.866 pre0_mod0_post0
7 Fold07 accuracy binary 0.888 pre0_mod0_post0
8 Fold08 accuracy binary 0.854 pre0_mod0_post0
9 Fold09 accuracy binary 0.863 pre0_mod0_post0
10 Fold10 accuracy binary 0.848 pre0_mod0_post0
Plot a histogram of the performance estimates.
metrics_kfold_2 |> plot_hist(".estimate")Print the average performance over folds with the summarize = TRUE argument.
collect_metrics(fits_kfold_2, summarize = TRUE)# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.861 10 0.00451 pre0_mod0_post0
Selecting the best model configuration
Looking back at your two average performance estimates (one from each feature set), which model configuration would you select and why?
Looking back at our two average performance estimates (one from each feature set), we would select model configuration 2. It has higher average performance across held-out folds, an though fold 1 has a longer SE, the SE of fold 2 is still quite low, and it’s bias is much better.
Part 2: Tuning hyperparameters with bootstrap resampling
Split data
Split your data using bootstrap. Stratify on cigs and use 100 bootstraps. Don’t forget to set a seed!
set.seed(021802)
splits_boot <- data_all |>
bootstraps(times = 100, strata = "cigs")
splits_boot |> print(n=5)# Bootstrap sampling using stratification
# A tibble: 100 × 2
splits id
<list> <chr>
1 <split [5876/2129]> Bootstrap001
2 <split [5876/2158]> Bootstrap002
3 <split [5876/2176]> Bootstrap003
4 <split [5876/2219]> Bootstrap004
5 <split [5876/2165]> Bootstrap005
# ℹ 95 more rows
Set up hyperparameter grid
Create a tibble with all values of the hyperparameter (k) you will consider. Remember that this dataset has a lot of observations in it (~6000), so starting with some higher k-values than we’re used to testing may be helpful.
hyper_grid <- expand.grid(neighbors = seq(5, 205, by = 20))Feature Set 1
Build a recipe with the following specifications:
- Use
cigsas the outcome variable regressed oncraving.
- Set up the outcome as a factor with the positive class (“yes”) as the second level.
- Use the same numeric transformation of craving as before
This is the same recipe as earlier so you can just re-assign it here (e.g., rec_boot_1 <- rec_kfold_1). Also remember even though this is going to be used in a KNN model, we don’t have to scale or range correct if there is only one predictor!
rec_boot_1 <- rec_kfold_1 Tune the model with bootstrapping for cross-validation. Use KNN classification as your statistical algorithm, rec_boot_1 as your recipe, hyper_grid as your tuning grid, and accuracy as your metric. Hint: use tune_grid() instead of fit_resamples(), and specify neighbors = tune() in your algorithm. Don’t forget to pass grid = hyper_grid.
plan(multisession, workers = parallel::detectCores(logical = FALSE))
tune_boot_1 <- cache_rds(
expr = {
nearest_neighbor(neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification") |>
tune_grid(preprocessor = rec_boot_1,
resamples = splits_boot,
grid = hyper_grid,
metrics = metric_set(accuracy))
},
dir = "cache/",
file = "tune_boot_1",
rerun = rerun_setting)
plan(sequential)Examine performance estimates across the OOB held-out sets. Remember that you will now see one estimate per hyperparameter value, averaged across bootstraps.
collect_metrics(tune_boot_1, summarize = TRUE) |>
print_kbl()| neighbors | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| 5 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod01_post0 |
| 25 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod02_post0 |
| 45 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod03_post0 |
| 65 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod04_post0 |
| 85 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod05_post0 |
| 105 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod06_post0 |
| 125 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod07_post0 |
| 145 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod08_post0 |
| 165 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod09_post0 |
| 185 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod10_post0 |
| 205 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod11_post0 |
Plot the average performance by hyperparameter value. Hint: you can use plot_hyperparameters() from fun_ml.R, specifying hp1 and metric arguments.
plot_hyperparameters(tune_boot_1, hp1 = "neighbors", metric = "accuracy")Comment on what you see in the hyperparameter plot. Does performance change across values of k? What might this tell you about the model?
Print the performance of your best model configuration with the show_best() function.
show_best(tune_boot_1, n = 1)Warning in show_best(tune_boot_1, n = 1): No value of `metric` was given;
"accuracy" will be used.
# A tibble: 1 × 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 65 accuracy binary 0.817 100 0.000433 pre0_mod04_post0
Feature Set 2
Build a second recipe with the following specifications:
- Use
cigsas the outcome variable regressed on all variables.
- Set up the outcome as a factor with the positive class (“yes”) as the second level.
- Set up other variables as factors where needed.
- Apply dummy coding.
- Create an interaction between
stressful_eventandlag. - Remove the
subidvariable withstep_rm() - Take appropriate scaling steps for features to be able to fit KNN requirements. Hint: you can build on
rec_kfold_2and addstep_range(all_numeric_predictors())for range correction.
rec_boot_2 <- rec_kfold_2 |>
step_range(all_numeric_predictors())Tune the model with bootstrapping for cross-validation. Use KNN classification as your statistical algorithm, rec_boot_2 as your recipe, hyper_grid as your tuning grid, and accuracy as your metric.
plan(multisession, workers = parallel::detectCores(logical = FALSE))
tune_boot_2 <- cache_rds(
expr = {
nearest_neighbor(neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification") |>
tune_grid(preprocessor = rec_boot_2,
resamples = splits_boot,
grid = hyper_grid,
metrics = metric_set(accuracy))
},
dir = "cache/",
file = "tune_boot_2",
rerun = rerun_setting)
plan(sequential)Examine performance estimates. Remember that you will now see one estimate per hyperparameter value, averaged across bootstraps.
collect_metrics(tune_boot_2, summarize = TRUE) |>
print_kbl()| neighbors | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| 5 | accuracy | binary | 0.82 | 100 | 0 | pre0_mod01_post0 |
| 25 | accuracy | binary | 0.85 | 100 | 0 | pre0_mod02_post0 |
| 45 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod03_post0 |
| 65 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod04_post0 |
| 85 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod05_post0 |
| 105 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod06_post0 |
| 125 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod07_post0 |
| 145 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod08_post0 |
| 165 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod09_post0 |
| 185 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod10_post0 |
| 205 | accuracy | binary | 0.86 | 100 | 0 | pre0_mod11_post0 |
Plot the average performance by hyperparameter value.
plot_hyperparameters(tune_boot_2, hp1 = "neighbors", metric = "accuracy")Comment on what you see in the hyperparameter plot. Does the plot suggest you’ve considered a sufficient range of k-values?
Here we can see a peak and then slow, continuous decrease in performance. That peak makes us feel confident that we’ve considered a sufficient range of k-values.
Print the performance of your best model configuration with the show_best() function.
show_best(tune_boot_2, n = 1)Warning in show_best(tune_boot_2, n = 1): No value of `metric` was given;
"accuracy" will be used.
# A tibble: 1 × 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 105 accuracy binary 0.858 100 0.000672 pre0_mod06_post0
Part 3: Selecting among model configurations with grouped k-fold cross-validation
In the final part of this assignment, you will use grouped k-fold cross-validation to match the data structure.
Split data
Split your data into 10 folds using the group_vfold_cv() function. Use the grouping argument (group = "subid").
For this example, we asked you to not stratify on cigs. As a note, if you ever want to stratify on your outcome (e.g., cigs) within a grouping variable (e.g., subid), you need to create a new variable capturing the distribution of the outcome within each subject (in this case, the mean number of positive cases of cigs within each subject). This allows the splitting to attempt to assign different subid groups to analysis and assessment sets within each fold in combinations that will yield roughly the same proportion of positive cases on the outcome across all folds.
Try implementing this approach below. You will need to:
- Create a new
stratavariable that captures the proportion of positive cases ofcigswithin eachsubid(hint:group_by(subid), thenmutate()withmean(as.numeric(cigs)), thenungroup()) - Use
group_vfold_cv()withgroup = "subid",v = 10,repeats = 1, andstrata = "strata"
Don’t forget to set a seed!
data_all <- data_all |>
group_by(subid) |>
mutate(strata = mean(as.numeric(cigs))) |> # creates new strata variable
ungroup()
set.seed(021802)
splits_group_kfold <- group_vfold_cv(data_all,
group = "subid",
v = 10,
repeats = 1,
strata = "strata")Feature Set 1
Fit the first set of models with grouped k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_1 as your recipe (doesn’t need to change from above!), and accuracy as your metric.
plan(multisession, workers = parallel::detectCores(logical = FALSE))
fits_group_kfold_1 <- cache_rds(
expr = {
logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_kfold_1,
resamples = splits_group_kfold,
metrics = metric_set(accuracy))
},
dir = "cache/",
file = "fits_group_kfold_1",
rerun = rerun_setting)
plan(sequential)Examine performance estimates on held out samples.
metrics_group_kfold_1 <- collect_metrics(fits_group_kfold_1,
summarize = FALSE)
metrics_group_kfold_1 |>
print_kbl()| id | .metric | .estimator | .estimate | .config |
|---|---|---|---|---|
| Resample01 | accuracy | binary | 0.92 | pre0_mod0_post0 |
| Resample02 | accuracy | binary | 0.80 | pre0_mod0_post0 |
| Resample03 | accuracy | binary | 0.78 | pre0_mod0_post0 |
| Resample04 | accuracy | binary | 0.82 | pre0_mod0_post0 |
| Resample05 | accuracy | binary | 0.70 | pre0_mod0_post0 |
| Resample06 | accuracy | binary | 0.84 | pre0_mod0_post0 |
| Resample07 | accuracy | binary | 0.83 | pre0_mod0_post0 |
| Resample08 | accuracy | binary | 0.95 | pre0_mod0_post0 |
| Resample09 | accuracy | binary | 0.76 | pre0_mod0_post0 |
| Resample10 | accuracy | binary | 0.76 | pre0_mod0_post0 |
Plot a histogram of the performance estimates.
metrics_group_kfold_1 |>
plot_hist(".estimate")Print the average performance over folds with the summarize = TRUE argument.
collect_metrics(fits_group_kfold_1, summarize = TRUE)# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.816 10 0.0234 pre0_mod0_post0
Feature Set 2
Fit the second model with grouped k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_2 as your recipe (doesn’t need to change from above!), and accuracy as your metric.
plan(multisession, workers = parallel::detectCores(logical = FALSE))
fits_group_kfold_2 <- cache_rds(
expr = {
logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_kfold_2,
resamples = splits_group_kfold,
metrics = metric_set(accuracy))
},
dir = "cache/",
file = "fits_group_kfold_2",
rerun = rerun_setting)
plan(sequential)Examine performance estimates on held out samples.
metrics_group_kfold_2 <- collect_metrics(fits_group_kfold_2,
summarize = FALSE)
metrics_group_kfold_2 |>
print_kbl()| id | .metric | .estimator | .estimate | .config |
|---|---|---|---|---|
| Resample01 | accuracy | binary | 0.90 | pre0_mod0_post0 |
| Resample02 | accuracy | binary | 0.87 | pre0_mod0_post0 |
| Resample03 | accuracy | binary | 0.83 | pre0_mod0_post0 |
| Resample04 | accuracy | binary | 0.87 | pre0_mod0_post0 |
| Resample05 | accuracy | binary | 0.79 | pre0_mod0_post0 |
| Resample06 | accuracy | binary | 0.83 | pre0_mod0_post0 |
| Resample07 | accuracy | binary | 0.86 | pre0_mod0_post0 |
| Resample08 | accuracy | binary | 0.94 | pre0_mod0_post0 |
| Resample09 | accuracy | binary | 0.81 | pre0_mod0_post0 |
| Resample10 | accuracy | binary | 0.86 | pre0_mod0_post0 |
Plot a histogram of the performance estimates.
metrics_group_kfold_2 |>
plot_hist(".estimate")Print the average performance over folds with the summarize = TRUE argument.
collect_metrics(fits_group_kfold_2, summarize = TRUE)# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.858 10 0.0141 pre0_mod0_post0
Selecting the best model configuration
Looking back at your two average performance estimates (one from each feature set), which model configuration would you select and why?
We would select model configuration 2 because it has the higher average performance across folds, as well as a lower SE!
Remember, once your best model is chosen and evaluated you would then want to refit it once on all the data before deploying it for use!
Save, Render, Upload
Save
Save this .qmd file with your last name at the end (e.g., hw_unit_05_resampling_punturieri.qmd). Make sure you changed “Your Name” at the top of the file to be your own name.
Render
Render the file to html and upload !ONLY! your rendered html file to Canvas.