Homework Unit 5: Resampling

Author

HW5 Key

Published

February 23, 2026

Introduction

In this assignment, you will select among model configurations using resampling methods. This includes training models with 2 different feature sets (specified in your recipe) and tuning hyperparameters (k in KNN).


Set up

Load required packages

library(tidyverse) 
library(tidymodels)

Handle conflicts

options(conflicts.policy = "depends.ok")

Load in additional packages (only what you need)

library(future)
library(xfun, include.only = "cache_rds")

Source function scripts (John’s or your own)

source("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true")
source("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true")
source("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")

Specify other global settings

Since we are going to use cache_rds(), we are also going to include rerun_setting <- FALSE in this chunk

theme_set(theme_bw())
options(tibble.width = Inf, dplyr.print_max=Inf)
rerun_setting <- FALSE

Paths

path_data <- "application_assignments/unit_05"

Read in data

Read in the smoking_ema.csv data file.
Note you can set column types (col_types = cols()) or set show_col_types = FALSE if you don’t want the read_csv() output which is redundant with some of the glimpse() output!

data_all <- read_csv(here::here(path_data, "smoking_ema.csv"),
                     show_col_types = FALSE) |> 
  glimpse()
Rows: 5,876
Columns: 13
$ subid           <dbl> 11011, 11011, 11011, 11011, 11011, 11011, 11011, 11011…
$ cigs            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ craving         <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 4, …
$ anxious         <dbl> 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ irritable       <dbl> 1.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ sad             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happy           <dbl> 0.0, 0.0, 3.0, 2.0, 2.0, 1.0, 4.0, 3.0, 3.0, 2.0, 3.0,…
$ stressful_event <dbl> 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ positive_event  <dbl> 0, 0, 2, 1, 0, 1, 4, 0, 0, 1, 3, 0, 0, 2, 0, 1, 0, 1, …
$ txt             <chr> "nrt", "nrt", "nrt", "nrt", "nrt", "nrt", "nrt", "nrt"…
$ prior_cigs      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ lag             <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.152222, 3.4…
$ time_since_quit <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.6630208,…

Lets skim the data to check mins, maxs, and missing values.

data_all |> 
  skim_some()
Data summary
Name data_all
Number of rows 5876
Number of columns 13
_______________________
Column type frequency:
character 1
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
txt 0 1 3 7 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
subid 0 1 11011.00 22149.00
cigs 0 1 0.00 1.00
craving 0 1 0.00 4.00
anxious 0 1 0.00 4.00
irritable 0 1 0.00 4.00
sad 0 1 0.00 4.00
happy 0 1 0.00 4.00
stressful_event 0 1 0.00 4.00
positive_event 0 1 0.00 4.00
prior_cigs 0 1 0.00 1.00
lag 0 1 0.09 108.00
time_since_quit 0 1 0.10 30.03

Next we are going to use the data dictionary to class variables
We are doing this on the whole dataset prior to splitting it, so no need for a function (i.e., we are only using this code once!)

valence_labels <- c("not at all", "mildly", "moderately", "very", "extremely")
event_labels <- c("no", "mild", "moderate", "very", "extreme")

data_all <- data_all |> 
  mutate(cigs = factor(cigs, levels = 0:1, labels = c("no", "yes")),
         across(c(craving, anxious, irritable, sad, happy), 
                ~ factor(.x, levels = 0:4, labels = valence_labels)),
         across(contains("_event"),
                ~ factor(.x, levels = 0:4, labels = event_labels)),
         txt = factor(txt, levels = c("placebo", "nrt")),
         prior_cigs = factor(prior_cigs, levels = 0:1, labels = c("no", "yes"))) |> 
  glimpse()
Rows: 5,876
Columns: 13
$ subid           <dbl> 11011, 11011, 11011, 11011, 11011, 11011, 11011, 11011…
$ cigs            <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ craving         <fct> moderately, not at all, not at all, mildly, not at all…
$ anxious         <fct> not at all, moderately, not at all, not at all, not at…
$ irritable       <fct> mildly, not at all, not at all, not at all, very, mild…
$ sad             <fct> not at all, not at all, not at all, not at all, not at…
$ happy           <fct> not at all, not at all, very, moderately, moderately, …
$ stressful_event <fct> no, moderate, no, no, no, moderate, no, no, no, no, no…
$ positive_event  <fct> no, no, moderate, mild, no, mild, extreme, no, no, mil…
$ txt             <fct> nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt,…
$ prior_cigs      <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ lag             <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.152222, 3.4…
$ time_since_quit <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.6630208,…

Lets skim again to check our new variables

data_all |> 
  skim_some()
Data summary
Name data_all
Number of rows 5876
Number of columns 13
_______________________
Column type frequency:
factor 10
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cigs 0 1.00 FALSE 2 no: 4798, yes: 1078
craving 71 0.99 FALSE 5 mil: 1766, not: 1669, mod: 1344, ver: 594
anxious 56 0.99 FALSE 5 not: 2913, mil: 1829, mod: 788, ver: 186
irritable 61 0.99 FALSE 5 not: 3131, mil: 1679, mod: 649, ver: 234
sad 35 0.99 FALSE 5 not: 4176, mil: 1081, mod: 380, ver: 130
happy 79 0.99 FALSE 5 mil: 1792, mod: 1779, not: 1203, ver: 827
stressful_event 0 1.00 FALSE 5 no: 3776, mil: 1057, mod: 620, ver: 264
positive_event 0 1.00 FALSE 5 no: 3108, mil: 1228, mod: 1006, ver: 401
txt 0 1.00 FALSE 2 nrt: 2983, pla: 2893
prior_cigs 0 1.00 FALSE 2 no: 4783, yes: 1093

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
subid 0 1 11011.00 22149.00
lag 0 1 0.09 108.00
time_since_quit 0 1 0.10 30.03

Hmm now we have missing data - Lets investigate

If we re-read in the raw data we can look at the different values for our factor variables

data_all <- read_csv(here::here(path_data, "smoking_ema.csv"),
                     show_col_types = FALSE) 

data_all |>
  select(craving:positive_event) |> 
  map(janitor::tabyl)
$craving
   .x[[i]]    n      percent
 0.0000000 1669 0.2840367597
 0.5000000   16 0.0027229408
 0.6666667    1 0.0001701838
 0.7142857    1 0.0001701838
 0.7500000    1 0.0001701838
 1.0000000 1766 0.3005445882
 1.1666667    1 0.0001701838
 1.2500000    1 0.0001701838
 1.3000000    1 0.0001701838
 1.3333333    1 0.0001701838
 1.4000000    1 0.0001701838
 1.5000000   21 0.0035738598
 1.6000000    1 0.0001701838
 1.6666667    4 0.0006807352
 1.7500000    1 0.0001701838
 1.8333333    1 0.0001701838
 2.0000000 1344 0.2287270252
 2.3333333    1 0.0001701838
 2.5000000   12 0.0020422056
 2.6666667    1 0.0001701838
 3.0000000  594 0.1010891763
 3.3333333    2 0.0003403676
 3.5000000    3 0.0005105514
 4.0000000  432 0.0735194010

$anxious
   .x[[i]]    n      percent
 0.0000000 2913 0.4957454050
 0.3333333    4 0.0006807352
 0.5000000   16 0.0027229408
 0.7500000    1 0.0001701838
 1.0000000 1829 0.3112661675
 1.2000000    1 0.0001701838
 1.3333333    2 0.0003403676
 1.4000000    1 0.0001701838
 1.5000000   22 0.0037440436
 1.8333333    1 0.0001701838
 2.0000000  788 0.1341048332
 2.3333333    1 0.0001701838
 2.5000000    4 0.0006807352
 3.0000000  186 0.0316541865
 3.3333333    1 0.0001701838
 3.5000000    2 0.0003403676
 4.0000000  104 0.0176991150

$irritable
   .x[[i]]    n      percent
 0.0000000 3131 0.5328454731
 0.2500000    1 0.0001701838
 0.3333333    3 0.0005105514
 0.5000000   24 0.0040844112
 0.6666667    2 0.0003403676
 0.8333333    1 0.0001701838
 0.9000000    1 0.0001701838
 1.0000000 1679 0.2857385977
 1.3333333    1 0.0001701838
 1.5000000   15 0.0025527570
 1.6000000    1 0.0001701838
 1.7500000    1 0.0001701838
 2.0000000  649 0.1104492852
 2.5000000    7 0.0011912866
 2.6666667    1 0.0001701838
 3.0000000  234 0.0398230088
 3.3333333    1 0.0001701838
 3.5000000    2 0.0003403676
 4.0000000  122 0.0207624234

$sad
   .x[[i]]    n      percent
 0.0000000 4176 0.7106875425
 0.1666667    1 0.0001701838
 0.5000000   19 0.0032334922
 0.6666667    1 0.0001701838
 1.0000000 1081 0.1839686862
 1.5000000    7 0.0011912866
 1.6666667    2 0.0003403676
 2.0000000  380 0.0646698434
 2.4000000    1 0.0001701838
 2.5000000    2 0.0003403676
 3.0000000  130 0.0221238938
 3.3333333    1 0.0001701838
 3.5000000    1 0.0001701838
 4.0000000   74 0.0125936011

$happy
   .x[[i]]    n      percent
 0.0000000 1203 0.2047311096
 0.3333333    1 0.0001701838
 0.5000000   18 0.0030633084
 0.6666667    2 0.0003403676
 0.7500000    1 0.0001701838
 1.0000000 1792 0.3049693669
 1.3333333    3 0.0005105514
 1.4000000    1 0.0001701838
 1.5000000   25 0.0042545950
 1.6666667    3 0.0005105514
 2.0000000 1779 0.3027569775
 2.3333333    2 0.0003403676
 2.5000000   20 0.0034036760
 2.7500000    1 0.0001701838
 3.0000000  827 0.1407420014
 3.3333333    2 0.0003403676
 4.0000000  196 0.0333560245

$stressful_event
 .x[[i]]    n    percent
       0 3776 0.64261402
       1 1057 0.17988428
       2  620 0.10551396
       3  264 0.04492852
       4  159 0.02705922

$positive_event
 .x[[i]]    n    percent
       0 3108 0.52893125
       1 1228 0.20898570
       2 1006 0.17120490
       3  401 0.06824370
       4  133 0.02263445

So now we see some of our ordinal variables have levels not in the data dictionary (i.e., they are not 0, 1, 2, 3, or 4).

What should we do? Are these errors that need to be corrected? Well thats up to you, the data scientist, to figure out! If you had access to the researchers who collected these data, you might start by discussing this with them. Otherwise, you may need to decide on your own how to proceed.

Some possibilities include, but are not limited to:

  • Treat the real (non-integer) numbers as errors, set them to NA, and add a step in your recipe to impute them.
  • Keep the variables in their raw numeric form.
  • Round the variables to the nearest whole number and then convert to factor.

For example you could use the following code

data_round <- data_all |> 
  mutate(across(craving:happy, round))

data_round |>
  select(craving:happy) |> 
  map(janitor::tabyl)
$craving
 .x[[i]]    n    percent
       0 1685 0.28675970
       1 1774 0.30190606
       2 1385 0.23570456
       3  597 0.10159973
       4  435 0.07402995

$anxious
 .x[[i]]    n    percent
       0 2933 0.49914908
       1 1834 0.31211709
       2  816 0.13886998
       3  187 0.03182437
       4  106 0.01803948

$irritable
 .x[[i]]    n    percent
       0 3159 0.53761062
       1 1684 0.28658952
       2  673 0.11453370
       3  236 0.04016338
       4  124 0.02110279

$sad
 .x[[i]]    n    percent
       0 4196 0.71409122
       1 1082 0.18413887
       2  392 0.06671205
       3  131 0.02229408
       4   75 0.01276378

$happy
 .x[[i]]    n    percent
       0 1222 0.20796460
       1 1799 0.30616065
       2 1829 0.31126617
       3  830 0.14125255
       4  196 0.03335602

There might be no right answer in cases like this and you have to make your best guess. Luckily, we know the people that collected this data. It turns out these scores were averaged across different survey responses and as a result non-integer numbers were introduced. Since, we know that these values are valid we are going to proceed by treating them as numeric variables in our models. We are still going to keep stressful_event and positive_event as ordinal variables since they have no non-integer values and class our other character variables as before.

data_all <- data_all |> 
  mutate(cigs = factor(cigs, levels = 0:1, labels = c("no", "yes")),
         across(contains("_event"),
                ~ factor(.x, levels = 0:4, labels = event_labels)),
         txt = factor(txt, levels = c("placebo", "nrt")),
         prior_cigs = factor(prior_cigs, levels = 0:1, labels = c("no", "yes"))) |> 
  glimpse()
Rows: 5,876
Columns: 13
$ subid           <dbl> 11011, 11011, 11011, 11011, 11011, 11011, 11011, 11011…
$ cigs            <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ craving         <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 4, …
$ anxious         <dbl> 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ irritable       <dbl> 1.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ sad             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happy           <dbl> 0.0, 0.0, 3.0, 2.0, 2.0, 1.0, 4.0, 3.0, 3.0, 2.0, 3.0,…
$ stressful_event <fct> no, moderate, no, no, no, moderate, no, no, no, no, no…
$ positive_event  <fct> no, no, moderate, mild, no, mild, extreme, no, no, mil…
$ txt             <fct> nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt, nrt,…
$ prior_cigs      <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no…
$ lag             <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.152222, 3.4…
$ time_since_quit <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.6630208,…

Lets skim the data one more time to be sure everything looks good.

data_all |> 
  skim_some()
Data summary
Name data_all
Number of rows 5876
Number of columns 13
_______________________
Column type frequency:
factor 5
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cigs 0 1 FALSE 2 no: 4798, yes: 1078
stressful_event 0 1 FALSE 5 no: 3776, mil: 1057, mod: 620, ver: 264
positive_event 0 1 FALSE 5 no: 3108, mil: 1228, mod: 1006, ver: 401
txt 0 1 FALSE 2 nrt: 2983, pla: 2893
prior_cigs 0 1 FALSE 2 no: 4783, yes: 1093

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
subid 0 1 11011.00 22149.00
craving 0 1 0.00 4.00
anxious 0 1 0.00 4.00
irritable 0 1 0.00 4.00
sad 0 1 0.00 4.00
happy 0 1 0.00 4.00
lag 0 1 0.09 108.00
time_since_quit 0 1 0.10 30.03

Part 1: Selecting among model configurations with k-fold cross-validation

Split data

Split your data into 10 folds, stratified on cigs. Don’t forget to set a seed!

set.seed(021802)

splits_kfold <- data_all |> 
  vfold_cv(v = 10, repeats = 1, strata = "cigs")

splits_kfold
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [5288/588]> Fold01
 2 <split [5288/588]> Fold02
 3 <split [5288/588]> Fold03
 4 <split [5288/588]> Fold04
 5 <split [5288/588]> Fold05
 6 <split [5288/588]> Fold06
 7 <split [5288/588]> Fold07
 8 <split [5288/588]> Fold08
 9 <split [5290/586]> Fold09
10 <split [5290/586]> Fold10

Feature Set 1

Build a recipe with the following specifications:

  • Use cigs as the outcome variable regressed on craving.
  • Set up the outcome as a factor with the positive class (“yes”) as the second level.
  • Use craving as a feature by simply transforming this ordinal predictor to numeric based on the order of its scores.

Note we already set our outcome variable up as factor with positive class as second level when we classed our data so we aren’t repeating that here. Also we have decided to keep craving as a numeric variable so we don’t have to do any manipulations here either.

rec_kfold_1 <- recipe(cigs ~ craving, data = data_all) 

As a double check you can prep and bake your recipe on the full data set to see if the variables look correct.

rec_prep <- rec_kfold_1 |> 
  prep(data_all)

feat_all <- rec_prep |> 
  bake(data_all) 

feat_all |> 
  glimpse()
Rows: 5,876
Columns: 2
$ craving <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 4, 2, 1, 0,…
$ cigs    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…

Fit the model with k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_1 as your recipe, and accuracy as your metric.

We are going to also use cache_rds as we fit our models. Hint: wrap your fit_resamples() call inside cache_rds(expr = { ... }, dir = "cache/", file = "meaningful_name", rerun = rerun_setting). See the web book for an example.

plan(multisession, workers = parallel::detectCores(logical = FALSE))

fits_kfold_1 <- cache_rds(
  expr = {
    logistic_reg() |> 
      set_engine("glm") |> 
      fit_resamples(preprocessor = rec_kfold_1,
                  resamples = splits_kfold,
                  metrics = metric_set(accuracy))
  }, 
  dir = "cache/",
  file = "fits_kfold_1",
  rerun = rerun_setting)

plan(sequential)

Examine performance estimates. Use the collect_metrics() function to make a table of the held-out performance estimates from the 10 folds.

metrics_kfold_1 <- collect_metrics(fits_kfold_1, summarize = FALSE)

metrics_kfold_1
# A tibble: 10 × 5
   id     .metric  .estimator .estimate .config        
   <chr>  <chr>    <chr>          <dbl> <chr>          
 1 Fold01 accuracy binary         0.816 pre0_mod0_post0
 2 Fold02 accuracy binary         0.816 pre0_mod0_post0
 3 Fold03 accuracy binary         0.816 pre0_mod0_post0
 4 Fold04 accuracy binary         0.816 pre0_mod0_post0
 5 Fold05 accuracy binary         0.816 pre0_mod0_post0
 6 Fold06 accuracy binary         0.816 pre0_mod0_post0
 7 Fold07 accuracy binary         0.816 pre0_mod0_post0
 8 Fold08 accuracy binary         0.816 pre0_mod0_post0
 9 Fold09 accuracy binary         0.817 pre0_mod0_post0
10 Fold10 accuracy binary         0.817 pre0_mod0_post0

Plot a histogram of the performance estimates. Hint: you can use plot_hist() from fun_plots.R.

metrics_kfold_1 |> plot_hist(".estimate")

Print the average performance over folds with the summarize = TRUE argument.

collect_metrics(fits_kfold_1, summarize = TRUE)
# A tibble: 1 × 6
  .metric  .estimator  mean     n  std_err .config        
  <chr>    <chr>      <dbl> <int>    <dbl> <chr>          
1 accuracy binary     0.817    10 0.000144 pre0_mod0_post0

Feature Set 2

Build a second recipe with the following specifications:

  • Use cigs as the outcome variable regressed on all variables.
  • Set up the outcome as a factor with the positive class (“yes”) as the second level.
  • Set up other variables as factors where needed.
  • Apply dummy coding.
  • Create an interaction between stressful_event and lag.
  • Remove the subid variable with step_rm()

Note we already classed our character variables as factors when we read the data in so we don’t need to do this again

rec_kfold_2 <- recipe(cigs ~ ., data = data_all) |> 
  step_mutate_at(c(all_factor_predictors(), -txt, -prior_cigs), fn = as.numeric) |> 
  step_dummy(txt, prior_cigs) |> 
  step_interact(~ stressful_event:lag) |> 
  step_rm(subid)

Again check that your recipe is creating features how you expect it should.

rec_prep <- rec_kfold_2 |> 
  prep(data_all)

feat_all <- rec_prep |> 
  bake(data_all) 

feat_all |> 
  glimpse()
Rows: 5,876
Columns: 13
$ craving               <dbl> 2, 0, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 0, …
$ anxious               <dbl> 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ irritable             <dbl> 1.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0…
$ sad                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happy                 <dbl> 0.0, 0.0, 3.0, 2.0, 2.0, 1.0, 4.0, 3.0, 3.0, 2.0…
$ stressful_event       <dbl> 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ positive_event        <dbl> 1, 1, 3, 2, 1, 2, 5, 1, 1, 2, 4, 1, 1, 3, 1, 2, …
$ lag                   <dbl> 6.069444, 3.306389, 4.544444, 11.491111, 6.15222…
$ time_since_quit       <dbl> 0.6007639, 0.7385301, 0.9278819, 1.4066782, 1.66…
$ cigs                  <fct> no, no, no, no, no, no, no, no, no, no, no, no, …
$ txt_nrt               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ prior_cigs_yes        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stressful_event_x_lag <dbl> 6.069444, 9.919167, 4.544444, 11.491111, 6.15222…

Fit the model with k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_2 as your recipe, and accuracy as your metric.

plan(multisession, workers = parallel::detectCores(logical = FALSE))

fits_kfold_2 <- cache_rds(
  expr = {
    logistic_reg() |> 
      set_engine("glm") |> 
      fit_resamples(preprocessor = rec_kfold_2,
                  resamples = splits_kfold,
                  metrics = metric_set(accuracy))
  }, 
  dir = "cache/",
  file = "fits_kfold_2",
  rerun = rerun_setting)

plan(sequential)

Examine performance estimates. Use the collect_metrics() function to make a table of the held-out performance estimates.

metrics_kfold_2 <- collect_metrics(fits_kfold_2, summarize = FALSE)
metrics_kfold_2
# A tibble: 10 × 5
   id     .metric  .estimator .estimate .config        
   <chr>  <chr>    <chr>          <dbl> <chr>          
 1 Fold01 accuracy binary         0.861 pre0_mod0_post0
 2 Fold02 accuracy binary         0.866 pre0_mod0_post0
 3 Fold03 accuracy binary         0.876 pre0_mod0_post0
 4 Fold04 accuracy binary         0.850 pre0_mod0_post0
 5 Fold05 accuracy binary         0.838 pre0_mod0_post0
 6 Fold06 accuracy binary         0.866 pre0_mod0_post0
 7 Fold07 accuracy binary         0.888 pre0_mod0_post0
 8 Fold08 accuracy binary         0.854 pre0_mod0_post0
 9 Fold09 accuracy binary         0.863 pre0_mod0_post0
10 Fold10 accuracy binary         0.848 pre0_mod0_post0

Plot a histogram of the performance estimates.

metrics_kfold_2 |> plot_hist(".estimate")

Print the average performance over folds with the summarize = TRUE argument.

collect_metrics(fits_kfold_2, summarize = TRUE)
# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config        
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy binary     0.861    10 0.00451 pre0_mod0_post0

Selecting the best model configuration

Looking back at your two average performance estimates (one from each feature set), which model configuration would you select and why?

Looking back at our two average performance estimates (one from each feature set), we would select model configuration 2. It has higher average performance across held-out folds, an though fold 1 has a longer SE, the SE of fold 2 is still quite low, and it’s bias is much better.


Part 2: Tuning hyperparameters with bootstrap resampling

Split data

Split your data using bootstrap. Stratify on cigs and use 100 bootstraps. Don’t forget to set a seed!

set.seed(021802)

splits_boot <- data_all |> 
  bootstraps(times = 100, strata = "cigs")

splits_boot |> print(n=5)
# Bootstrap sampling using stratification 
# A tibble: 100 × 2
  splits              id          
  <list>              <chr>       
1 <split [5876/2129]> Bootstrap001
2 <split [5876/2158]> Bootstrap002
3 <split [5876/2176]> Bootstrap003
4 <split [5876/2219]> Bootstrap004
5 <split [5876/2165]> Bootstrap005
# ℹ 95 more rows

Set up hyperparameter grid

Create a tibble with all values of the hyperparameter (k) you will consider. Remember that this dataset has a lot of observations in it (~6000), so starting with some higher k-values than we’re used to testing may be helpful.

hyper_grid <- expand.grid(neighbors = seq(5, 205, by = 20))

Feature Set 1

Build a recipe with the following specifications:

  • Use cigs as the outcome variable regressed on craving.
  • Set up the outcome as a factor with the positive class (“yes”) as the second level.
  • Use the same numeric transformation of craving as before

This is the same recipe as earlier so you can just re-assign it here (e.g., rec_boot_1 <- rec_kfold_1). Also remember even though this is going to be used in a KNN model, we don’t have to scale or range correct if there is only one predictor!

rec_boot_1 <- rec_kfold_1 

Tune the model with bootstrapping for cross-validation. Use KNN classification as your statistical algorithm, rec_boot_1 as your recipe, hyper_grid as your tuning grid, and accuracy as your metric. Hint: use tune_grid() instead of fit_resamples(), and specify neighbors = tune() in your algorithm. Don’t forget to pass grid = hyper_grid.

plan(multisession, workers = parallel::detectCores(logical = FALSE))

tune_boot_1 <- cache_rds(
  expr = {
    nearest_neighbor(neighbors = tune()) |> 
      set_engine("kknn") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_boot_1,
              resamples = splits_boot,
              grid = hyper_grid,
              metrics = metric_set(accuracy))
  }, 
  dir = "cache/",
  file = "tune_boot_1",
  rerun = rerun_setting)

plan(sequential)

Examine performance estimates across the OOB held-out sets. Remember that you will now see one estimate per hyperparameter value, averaged across bootstraps.

collect_metrics(tune_boot_1, summarize = TRUE) |>  
  print_kbl()
neighbors .metric .estimator mean n std_err .config
5 accuracy binary 0.82 100 0 pre0_mod01_post0
25 accuracy binary 0.82 100 0 pre0_mod02_post0
45 accuracy binary 0.82 100 0 pre0_mod03_post0
65 accuracy binary 0.82 100 0 pre0_mod04_post0
85 accuracy binary 0.82 100 0 pre0_mod05_post0
105 accuracy binary 0.82 100 0 pre0_mod06_post0
125 accuracy binary 0.82 100 0 pre0_mod07_post0
145 accuracy binary 0.82 100 0 pre0_mod08_post0
165 accuracy binary 0.82 100 0 pre0_mod09_post0
185 accuracy binary 0.82 100 0 pre0_mod10_post0
205 accuracy binary 0.82 100 0 pre0_mod11_post0

Plot the average performance by hyperparameter value. Hint: you can use plot_hyperparameters() from fun_ml.R, specifying hp1 and metric arguments.

plot_hyperparameters(tune_boot_1, hp1 = "neighbors", metric = "accuracy")

Comment on what you see in the hyperparameter plot. Does performance change across values of k? What might this tell you about the model?

Print the performance of your best model configuration with the show_best() function.

show_best(tune_boot_1, n = 1)
Warning in show_best(tune_boot_1, n = 1): No value of `metric` was given;
"accuracy" will be used.
# A tibble: 1 × 7
  neighbors .metric  .estimator  mean     n  std_err .config         
      <dbl> <chr>    <chr>      <dbl> <int>    <dbl> <chr>           
1        65 accuracy binary     0.817   100 0.000433 pre0_mod04_post0

Feature Set 2

Build a second recipe with the following specifications:

  • Use cigs as the outcome variable regressed on all variables.
  • Set up the outcome as a factor with the positive class (“yes”) as the second level.
  • Set up other variables as factors where needed.
  • Apply dummy coding.
  • Create an interaction between stressful_event and lag.
  • Remove the subid variable with step_rm()
  • Take appropriate scaling steps for features to be able to fit KNN requirements. Hint: you can build on rec_kfold_2 and add step_range(all_numeric_predictors()) for range correction.
rec_boot_2 <- rec_kfold_2 |> 
  step_range(all_numeric_predictors())

Tune the model with bootstrapping for cross-validation. Use KNN classification as your statistical algorithm, rec_boot_2 as your recipe, hyper_grid as your tuning grid, and accuracy as your metric.

plan(multisession, workers = parallel::detectCores(logical = FALSE))

tune_boot_2 <- cache_rds(
  expr = {
    nearest_neighbor(neighbors = tune()) |> 
      set_engine("kknn") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_boot_2,
              resamples = splits_boot,
              grid = hyper_grid,
              metrics = metric_set(accuracy))
  }, 
  dir = "cache/",
  file = "tune_boot_2",
  rerun = rerun_setting)

plan(sequential)

Examine performance estimates. Remember that you will now see one estimate per hyperparameter value, averaged across bootstraps.

collect_metrics(tune_boot_2, summarize = TRUE) |>  
  print_kbl()
neighbors .metric .estimator mean n std_err .config
5 accuracy binary 0.82 100 0 pre0_mod01_post0
25 accuracy binary 0.85 100 0 pre0_mod02_post0
45 accuracy binary 0.86 100 0 pre0_mod03_post0
65 accuracy binary 0.86 100 0 pre0_mod04_post0
85 accuracy binary 0.86 100 0 pre0_mod05_post0
105 accuracy binary 0.86 100 0 pre0_mod06_post0
125 accuracy binary 0.86 100 0 pre0_mod07_post0
145 accuracy binary 0.86 100 0 pre0_mod08_post0
165 accuracy binary 0.86 100 0 pre0_mod09_post0
185 accuracy binary 0.86 100 0 pre0_mod10_post0
205 accuracy binary 0.86 100 0 pre0_mod11_post0

Plot the average performance by hyperparameter value.

plot_hyperparameters(tune_boot_2, hp1 = "neighbors", metric = "accuracy")

Comment on what you see in the hyperparameter plot. Does the plot suggest you’ve considered a sufficient range of k-values?

Here we can see a peak and then slow, continuous decrease in performance. That peak makes us feel confident that we’ve considered a sufficient range of k-values.

Print the performance of your best model configuration with the show_best() function.

show_best(tune_boot_2, n = 1)
Warning in show_best(tune_boot_2, n = 1): No value of `metric` was given;
"accuracy" will be used.
# A tibble: 1 × 7
  neighbors .metric  .estimator  mean     n  std_err .config         
      <dbl> <chr>    <chr>      <dbl> <int>    <dbl> <chr>           
1       105 accuracy binary     0.858   100 0.000672 pre0_mod06_post0

Part 3: Selecting among model configurations with grouped k-fold cross-validation

In the final part of this assignment, you will use grouped k-fold cross-validation to match the data structure.

Split data

Split your data into 10 folds using the group_vfold_cv() function. Use the grouping argument (group = "subid").

For this example, we asked you to not stratify on cigs. As a note, if you ever want to stratify on your outcome (e.g., cigs) within a grouping variable (e.g., subid), you need to create a new variable capturing the distribution of the outcome within each subject (in this case, the mean number of positive cases of cigs within each subject). This allows the splitting to attempt to assign different subid groups to analysis and assessment sets within each fold in combinations that will yield roughly the same proportion of positive cases on the outcome across all folds.

Try implementing this approach below. You will need to:

  1. Create a new strata variable that captures the proportion of positive cases of cigs within each subid (hint: group_by(subid), then mutate() with mean(as.numeric(cigs)), then ungroup())
  2. Use group_vfold_cv() with group = "subid", v = 10, repeats = 1, and strata = "strata"

Don’t forget to set a seed!

data_all <- data_all |> 
  group_by(subid) |> 
  mutate(strata = mean(as.numeric(cigs))) |> # creates new strata variable
  ungroup()

set.seed(021802)

splits_group_kfold <- group_vfold_cv(data_all,
                                     group = "subid",
                                     v = 10,
                                     repeats = 1,
                                     strata = "strata")

Feature Set 1

Fit the first set of models with grouped k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_1 as your recipe (doesn’t need to change from above!), and accuracy as your metric.

plan(multisession, workers = parallel::detectCores(logical = FALSE))

fits_group_kfold_1 <- cache_rds(
  expr = {
    logistic_reg() |> 
      set_engine("glm") |> 
      fit_resamples(preprocessor = rec_kfold_1,
                  resamples = splits_group_kfold,
                  metrics = metric_set(accuracy))
  }, 
  dir = "cache/",
  file = "fits_group_kfold_1",
  rerun = rerun_setting)

plan(sequential)

Examine performance estimates on held out samples.

metrics_group_kfold_1 <- collect_metrics(fits_group_kfold_1,
                                         summarize = FALSE)
metrics_group_kfold_1 |> 
  print_kbl()
id .metric .estimator .estimate .config
Resample01 accuracy binary 0.92 pre0_mod0_post0
Resample02 accuracy binary 0.80 pre0_mod0_post0
Resample03 accuracy binary 0.78 pre0_mod0_post0
Resample04 accuracy binary 0.82 pre0_mod0_post0
Resample05 accuracy binary 0.70 pre0_mod0_post0
Resample06 accuracy binary 0.84 pre0_mod0_post0
Resample07 accuracy binary 0.83 pre0_mod0_post0
Resample08 accuracy binary 0.95 pre0_mod0_post0
Resample09 accuracy binary 0.76 pre0_mod0_post0
Resample10 accuracy binary 0.76 pre0_mod0_post0

Plot a histogram of the performance estimates.

metrics_group_kfold_1 |> 
  plot_hist(".estimate")

Print the average performance over folds with the summarize = TRUE argument.

collect_metrics(fits_group_kfold_1, summarize = TRUE)
# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config        
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy binary     0.816    10  0.0234 pre0_mod0_post0

Feature Set 2

Fit the second model with grouped k-fold cross-validation. Use logistic regression as your statistical algorithm, rec_kfold_2 as your recipe (doesn’t need to change from above!), and accuracy as your metric.

plan(multisession, workers = parallel::detectCores(logical = FALSE))

fits_group_kfold_2 <- cache_rds(
  expr = {
    logistic_reg() |> 
      set_engine("glm") |> 
      fit_resamples(preprocessor = rec_kfold_2,
                  resamples = splits_group_kfold,
                  metrics = metric_set(accuracy))
  }, 
  dir = "cache/",
  file = "fits_group_kfold_2",
  rerun = rerun_setting)

plan(sequential)

Examine performance estimates on held out samples.

metrics_group_kfold_2 <- collect_metrics(fits_group_kfold_2,
                                         summarize = FALSE)
metrics_group_kfold_2 |> 
  print_kbl()
id .metric .estimator .estimate .config
Resample01 accuracy binary 0.90 pre0_mod0_post0
Resample02 accuracy binary 0.87 pre0_mod0_post0
Resample03 accuracy binary 0.83 pre0_mod0_post0
Resample04 accuracy binary 0.87 pre0_mod0_post0
Resample05 accuracy binary 0.79 pre0_mod0_post0
Resample06 accuracy binary 0.83 pre0_mod0_post0
Resample07 accuracy binary 0.86 pre0_mod0_post0
Resample08 accuracy binary 0.94 pre0_mod0_post0
Resample09 accuracy binary 0.81 pre0_mod0_post0
Resample10 accuracy binary 0.86 pre0_mod0_post0

Plot a histogram of the performance estimates.

metrics_group_kfold_2 |> 
  plot_hist(".estimate")

Print the average performance over folds with the summarize = TRUE argument.

collect_metrics(fits_group_kfold_2, summarize = TRUE)
# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config        
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy binary     0.858    10  0.0141 pre0_mod0_post0

Selecting the best model configuration

Looking back at your two average performance estimates (one from each feature set), which model configuration would you select and why?

We would select model configuration 2 because it has the higher average performance across folds, as well as a lower SE!

Remember, once your best model is chosen and evaluated you would then want to refit it once on all the data before deploying it for use!


Save, Render, Upload

Save

Save this .qmd file with your last name at the end (e.g., hw_unit_05_resampling_punturieri.qmd). Make sure you changed “Your Name” at the top of the file to be your own name.

Render

Render the file to html and upload !ONLY! your rendered html file to Canvas.