Homework Unit 4 (Classification): Fit KNN Models

Author

TA Key

Published

February 17, 2026

Introduction

This file serves as the answer key for the KNN portion of the Unit_04 homework. Unit 4 Exploratory Introduction to Classification Models on the course website contains all materials required for this assignment.

In this assignment, we demonstrate how to fit multiple configurations of KNN and RDA classification models using data from the Titanic data set. By evaluating these various configurations in our validation set, we select the top performing model in our validation set out of all KNN and RDA models we fit. We use this best model to generate predictions for our held out test set, which we only use ONCE for evaluation of our final best model.


Set up

Load tidy packages

library(tidyverse) 
library(tidymodels)

Handle conflicts

options(conflicts.policy = "depends.ok")

Source function scripts (John’s or your own)

source("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true")
source("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")

Specify other global settings

theme_set(theme_classic())
options(tibble.width = Inf, dplyr.print_max=Inf)

Paths

path_data <- "application_assignments/unit_04"

Load data

Read in data files for cleaned training, validation, and test data (generated in hw_unit_04_eda_cleaning.qmd). Use here::here() and relative path for your data. Make sure your iaml project is open!

data_trn <- read_csv(here::here(path_data, "titanic_train_cln.csv"),
                     col_types = cols()) |> 
  glimpse()
Rows: 736
Columns: 12
$ passenger_id   <dbl> 3, 4, 11, 16, 17, 26, 39, 40, 41, 46, 53, 54, 59, 61, 6…
$ survived       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
$ pclass         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <chr> "Allison, Miss. Helen Loraine", "Allison, Mr. Hudson Jo…
$ sex            <chr> "female", "male", "male", "male", "male", "male", "male…
$ age            <dbl> 2, 30, 47, NA, 24, 25, 41, 48, NA, 45, 28, 17, 49, 36, …
$ n_sib_spouse   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
$ n_parent_child <dbl> 2, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ ticket         <chr> "113781", "113781", "PC 17757", "PC 17318", "PC 17558",…
$ fare           <dbl> 151.5500, 151.5500, 227.5250, 25.9250, 247.5208, 26.000…
$ cabin          <chr> "C22 C26", "C22 C26", "C62 C64", NA, "B58 B60", NA, "A2…
$ embarked       <chr> "s", "s", "c", "s", "c", "c", "s", "c", "c", "s", "s", …
data_val <- read_csv(here::here(path_data, "titanic_val_cln.csv"),
                     col_types = cols()) |> 
  glimpse()
Rows: 246
Columns: 12
$ passenger_id   <dbl> 2, 12, 18, 20, 21, 31, 42, 51, 52, 84, 85, 97, 101, 103…
$ survived       <chr> "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "…
$ pclass         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <chr> "Allison, Master. Hudson Trevor", "Astor, Mrs. John Jac…
$ sex            <chr> "male", "female", "female", "male", "male", "male", "fe…
$ age            <dbl> 0.9167, 18.0000, 50.0000, 36.0000, 37.0000, 45.0000, 44…
$ n_sib_spouse   <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 3, 0…
$ n_parent_child <dbl> 2, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0…
$ ticket         <chr> "113781", "PC 17757", "PC 17558", "13050", "11751", "11…
$ fare           <dbl> 151.5500, 227.5250, 247.5208, 75.2417, 52.5542, 35.5000…
$ cabin          <chr> "C22 C26", "C62 C64", "B58 B60", "C6", "D35", "T", "B4"…
$ embarked       <chr> "s", "c", "c", "c", "s", "s", "c", "c", "s", "s", "c", …
data_test <- read_csv(here::here(path_data, "titanic_test_cln.csv"),
                      col_types = cols()) |> 
  glimpse()
Rows: 327
Columns: 12
$ passenger_id   <dbl> 1, 5, 8, 9, 10, 15, 24, 25, 29, 33, 35, 45, 47, 49, 50,…
$ survived       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pclass         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <chr> "Allen, Miss. Elisabeth Walton", "Allison, Mrs. Hudson …
$ sex            <chr> "female", "female", "male", "female", "male", "male", "…
$ age            <dbl> 29, 25, 39, 53, 71, 80, 42, 29, 35, 30, 42, 41, NA, 53,…
$ n_sib_spouse   <dbl> 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0…
$ n_parent_child <dbl> 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0…
$ ticket         <chr> "24160", "113781", "112050", "11769", "PC 17609", "2704…
$ fare           <dbl> 211.3375, 151.5500, 0.0000, 51.4792, 49.5042, 30.0000, …
$ cabin          <chr> "B5", "C22 C26", "A36", "C101", NA, "A23", NA, "C97", "…
$ embarked       <chr> "s", "s", "s", "s", "c", "s", "c", "s", "s", "s", "s", …

Set appropriate variable classes (remember character classes should be factors).

Remember: - Nominal and ordinal variables are set to factors - Interval and ratio variables are set to numeric

Remember ordinal variables like pclass should have the factor levels explicitly stated to retain the order.

It also helps to class survived as a factor because otherwise it will be read as logical in the test set where the scores are all NA (remember – it is correct that these are NA, the TAs will keep the labeled test set to themselves for competition-scoring purposes).

We will handle this by writing a simple function. This is probably not necessary since classing is so quick, but it’s good practice! We will handle order for the factors we care about explicitly, and then fix the rest using across().

class_titantic <- function(d){
  d |> 
    mutate(survived = factor(survived, levels = c("no", "yes")),
                             pclass = factor(pclass, levels = c("1", "2", "3")),
                             across(where(is.character), factor))
}

Now, let’s apply it to all three sets.

data_trn <- data_trn |>
  class_titantic() |> 
  glimpse()
Rows: 736
Columns: 12
$ passenger_id   <dbl> 3, 4, 11, 16, 17, 26, 39, 40, 41, 46, 53, 54, 59, 61, 6…
$ survived       <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
$ pclass         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <fct> "Allison, Miss. Helen Loraine", "Allison, Mr. Hudson Jo…
$ sex            <fct> female, male, male, male, male, male, male, male, male,…
$ age            <dbl> 2, 30, 47, NA, 24, 25, 41, 48, NA, 45, 28, 17, 49, 36, …
$ n_sib_spouse   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
$ n_parent_child <dbl> 2, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ ticket         <fct> 113781, 113781, PC 17757, PC 17318, PC 17558, 13905, 11…
$ fare           <dbl> 151.5500, 151.5500, 227.5250, 25.9250, 247.5208, 26.000…
$ cabin          <fct> C22 C26, C22 C26, C62 C64, NA, B58 B60, NA, A21, B10, N…
$ embarked       <fct> s, s, c, s, c, c, s, c, c, s, s, s, s, s, s, c, s, c, s…
data_val <- data_val |>
  class_titantic() |> 
  glimpse()
Rows: 246
Columns: 12
$ passenger_id   <dbl> 2, 12, 18, 20, 21, 31, 42, 51, 52, 84, 85, 97, 101, 103…
$ survived       <fct> yes, yes, yes, no, yes, no, yes, yes, no, yes, no, no, …
$ pclass         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <fct> "Allison, Master. Hudson Trevor", "Astor, Mrs. John Jac…
$ sex            <fct> male, female, female, male, male, male, female, female,…
$ age            <dbl> 0.9167, 18.0000, 50.0000, 36.0000, 37.0000, 45.0000, 44…
$ n_sib_spouse   <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 3, 0…
$ n_parent_child <dbl> 2, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0…
$ ticket         <fct> 113781, PC 17757, PC 17558, 13050, 11751, 113784, PC 17…
$ fare           <dbl> 151.5500, 227.5250, 247.5208, 75.2417, 52.5542, 35.5000…
$ cabin          <fct> C22 C26, C62 C64, B58 B60, C6, D35, T, B4, B51 B53 B55,…
$ embarked       <fct> s, c, c, c, s, s, c, c, s, s, c, c, c, c, c, c, c, s, s…
data_test <- data_test |>
  class_titantic() |> 
  glimpse()
Rows: 327
Columns: 12
$ passenger_id   <dbl> 1, 5, 8, 9, 10, 15, 24, 25, 29, 33, 35, 45, 47, 49, 50,…
$ survived       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pclass         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <fct> "Allen, Miss. Elisabeth Walton", "Allison, Mrs. Hudson …
$ sex            <fct> female, female, male, female, male, male, female, femal…
$ age            <dbl> 29, 25, 39, 53, 71, 80, 42, 29, 35, 30, 42, 41, NA, 53,…
$ n_sib_spouse   <dbl> 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0…
$ n_parent_child <dbl> 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0…
$ ticket         <fct> 24160, 113781, 112050, 11769, PC 17609, 27042, PC 17757…
$ fare           <dbl> 211.3375, 151.5500, 0.0000, 51.4792, 49.5042, 30.0000, …
$ cabin          <fct> B5, C22 C26, A36, C101, NA, A23, NA, C97, C99, C7, D22,…
$ embarked       <fct> s, s, s, s, c, s, c, s, s, s, s, c, s, c, c, s, s, c, s…

Here, we get creative and make some additional predictors (title and group_size) that we hypothesize might help in predicting survived.

We could have also simply done this in our recipe. Perhaps most importantly, we decided to do this using modeling EDA ONLY BASED ON TRAINING DATA.

add_predictors <- function(d){
  d |> 
    mutate(group_size = n_parent_child + n_sib_spouse) |> 
    mutate(title = str_to_lower(str_extract(name,
                     "(?<=[:punct:][:space:])[:alpha:]{2,}(?=[:punct:])"))) |> 
    mutate(sparse_title = case_when(
             title == "master" ~ "boy",
             title %in% c("miss", "ms") ~ "young_woman",
             title %in% c("dona", "lady", "mrs") ~ "adult_woman",
             title %in% c("mr", "sir", "don", "jonkheer") ~ "adult_man",
             title %in% c("col", "dr", "major", "rev") ~ "professional",
             TRUE ~ NA_character_)) |>  
    mutate(sparse_title = case_when(
             is.na(sparse_title) & age < 12 & sex == "male" ~ "boy",
             is.na(sparse_title) & age < 18 & sex == "female" ~ "young_woman",
             is.na(sparse_title) & sex == "male" ~ "adult_man",
             is.na(sparse_title) & sex == "female" ~ "adult_woman",
             !is.na(sparse_title) ~ as.character(sparse_title),
             TRUE ~ NA_character_)) |>
    mutate(across(where(is.character), factor))
}

Now, we’ll apply this function to all of our sets.

data_trn <- data_trn |> 
  add_predictors() |>  
  glimpse()
Rows: 736
Columns: 15
$ passenger_id   <dbl> 3, 4, 11, 16, 17, 26, 39, 40, 41, 46, 53, 54, 59, 61, 6…
$ survived       <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
$ pclass         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <fct> "Allison, Miss. Helen Loraine", "Allison, Mr. Hudson Jo…
$ sex            <fct> female, male, male, male, male, male, male, male, male,…
$ age            <dbl> 2, 30, 47, NA, 24, 25, 41, 48, NA, 45, 28, 17, 49, 36, …
$ n_sib_spouse   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
$ n_parent_child <dbl> 2, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ ticket         <fct> 113781, 113781, PC 17757, PC 17318, PC 17558, 13905, 11…
$ fare           <dbl> 151.5500, 151.5500, 227.5250, 25.9250, 247.5208, 26.000…
$ cabin          <fct> C22 C26, C22 C26, C62 C64, NA, B58 B60, NA, A21, B10, N…
$ embarked       <fct> s, s, c, s, c, c, s, c, c, s, s, s, s, s, s, c, s, c, s…
$ group_size     <dbl> 3, 3, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 0…
$ title          <fct> miss, mr, col, mr, mr, mr, mr, mr, dr, major, mr, mr, m…
$ sparse_title   <fct> young_woman, adult_man, professional, adult_man, adult_…
data_val <- data_val |> 
  add_predictors() |>  
  glimpse()
Rows: 246
Columns: 15
$ passenger_id   <dbl> 2, 12, 18, 20, 21, 31, 42, 51, 52, 84, 85, 97, 101, 103…
$ survived       <fct> yes, yes, yes, no, yes, no, yes, yes, no, yes, no, no, …
$ pclass         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <fct> "Allison, Master. Hudson Trevor", "Astor, Mrs. John Jac…
$ sex            <fct> male, female, female, male, male, male, female, female,…
$ age            <dbl> 0.9167, 18.0000, 50.0000, 36.0000, 37.0000, 45.0000, 44…
$ n_sib_spouse   <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 3, 0…
$ n_parent_child <dbl> 2, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0…
$ ticket         <fct> 113781, PC 17757, PC 17558, 13050, 11751, 113784, PC 17…
$ fare           <dbl> 151.5500, 227.5250, 247.5208, 75.2417, 52.5542, 35.5000…
$ cabin          <fct> C22 C26, C62 C64, B58 B60, C6, D35, T, B4, B51 B53 B55,…
$ embarked       <fct> s, c, c, c, s, s, c, c, s, s, c, c, c, c, c, c, c, s, s…
$ group_size     <dbl> 3, 1, 1, 0, 2, 0, 0, 1, 0, 2, 1, 1, 1, 1, 1, 0, 0, 5, 0…
$ title          <fct> master, mrs, mrs, mr, mr, mr, mrs, mrs, mr, mrs, mr, mr…
$ sparse_title   <fct> boy, adult_woman, adult_woman, adult_man, adult_man, ad…
data_test <- data_test |> 
  add_predictors() |>  
  glimpse()
Rows: 327
Columns: 15
$ passenger_id   <dbl> 1, 5, 8, 9, 10, 15, 24, 25, 29, 33, 35, 45, 47, 49, 50,…
$ survived       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pclass         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ name           <fct> "Allen, Miss. Elisabeth Walton", "Allison, Mrs. Hudson …
$ sex            <fct> female, female, male, female, male, male, female, femal…
$ age            <dbl> 29, 25, 39, 53, 71, 80, 42, 29, 35, 30, 42, 41, NA, 53,…
$ n_sib_spouse   <dbl> 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0…
$ n_parent_child <dbl> 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0…
$ ticket         <fct> 24160, 113781, 112050, 11769, PC 17609, 27042, PC 17757…
$ fare           <dbl> 211.3375, 151.5500, 0.0000, 51.4792, 49.5042, 30.0000, …
$ cabin          <fct> B5, C22 C26, A36, C101, NA, A23, NA, C97, C99, C7, D22,…
$ embarked       <fct> s, s, s, s, c, s, c, s, s, s, s, c, s, c, c, s, s, c, s…
$ group_size     <dbl> 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 3, 0, 0…
$ title          <fct> miss, mrs, mr, mrs, mr, mr, miss, miss, miss, miss, mr,…
$ sparse_title   <fct> young_woman, adult_woman, adult_man, adult_woman, adult…

Create tracking tibble

Create an empty tracking tibble to track the validation error across the various model configurations you will fit.

error_val <- tibble(model = character(),
                    acc_val = numeric()) |> 
  glimpse()
Rows: 0
Columns: 2
$ model   <chr> 
$ acc_val <dbl> 

Understanding Accuracy (Example)

Before we build any models, let’s talk a bit about accuracy. When outcomes are imbalanced, accuracy is not our best performance metric because accuracy is the number of correctly predicted observations (including both true positives and true negatives). We’ll learn about more advanced performance metrics in Unit 6, but for now let’s demonstrate why accuracy can be problematic.

Look at the distribution of our outcome in our traiining data. It’s not balanced!

data_trn |> 
  plot_bar("survived")

If I build a “model” that just regresses survived on the intercept, it will predict the more probable class (“no” for did not survive) for everyone in the training set. This is the logistic regression “equivalent” of predicting the mean when we don’t have any other data from which to make predictions.

rec_example <- recipe(survived ~ 1, data_trn)

rec_prep_example <- rec_example |> 
  prep(training = data_trn)

feat_trn_example <- rec_prep_example |> 
  bake(new_data = NULL)

feat_val_example <- rec_prep_example |> 
  bake(new_data = data_val)

fit_knn_example <- 
  nearest_neighbor(neighbors = 20) |> 
  set_engine("kknn")|> 
  set_mode("classification") |> 
  fit(survived ~ ., data = feat_trn_example)


predict(fit_knn_example, feat_val_example)$.pred_class
  [1] no no no no no no no no no no no no no no no no no no no no no no no no no
 [26] no no no no no no no no no no no no no no no no no no no no no no no no no
 [51] no no no no no no no no no no no no no no no no no no no no no no no no no
 [76] no no no no no no no no no no no no no no no no no no no no no no no no no
[101] no no no no no no no no no no no no no no no no no no no no no no no no no
[126] no no no no no no no no no no no no no no no no no no no no no no no no no
[151] no no no no no no no no no no no no no no no no no no no no no no no no no
[176] no no no no no no no no no no no no no no no no no no no no no no no no no
[201] no no no no no no no no no no no no no no no no no no no no no no no no no
[226] no no no no no no no no no no no no no no no no no no no no no
Levels: no yes

When we get the accuracy for this model…

accuracy_vec(feat_val_example$survived, 
             predict(fit_knn_example, feat_val_example)$.pred_class)
[1] 0.6178862

… we see that we can get almost 62% accuracy with no other information! This example is merely to illustrate that a) accuracy is not the best performance metric, especially for unbalanced outcomes; and b) knowing your outcome distribution is important for assessing how well your model is actually doing. For example, if we got accuracies in the low- to mid-60s, we could probably safely assume that we are not in fact doing that well because our no-information model can do almost that well on its own.

KNN Model 1

We start by feature engineering a variable title (see above and the key for modeling EDA for more information). We extract the passengers’ titles from their names (e.g., “Mr” or “Miss”). This variable represents a combination of age and sex because titles differ by both. That also means I can use age and sex variables to help fill in values when title is missing. Because there are many unique titles, we collapse it down into a sparse version (sparse_title). This isn’t required for for the homework but we show it to you here for learning purposes.

This meets the homework requirements for: Categorical > 2 levels Modified categorical More creative feature engineering

Set up recipe

Informed by EDA from your modeling EDA script, create a recipe for your first model.

# only using sparse_title in the model
rec_1 <- 
  recipe(survived ~ sparse_title, data = data_trn) |> 
  step_dummy(sparse_title)

Training feature matrix

Use the prep() function to prep your recipe.

rec_prep <- rec_1 |> 
  prep(training = data_trn)

Use the bake() function to generate the feature matrix of the training data.

feat_trn <- rec_prep |> 
  bake(new_data = NULL)

Fit your model

Fit a knn model predicting survived from your training feature matrix.

fit_knn_1_20nn <- 
  nearest_neighbor(neighbors = 20) |> 
  set_engine("kknn") |> 
  set_mode("classification") |> 
  fit(survived ~ ., data = feat_trn)

Validation feature matrix

Use bake() to generate the feature matrix of the validation data that we will use to assess your model.

feat_val <- rec_prep |> 
  bake(new_data = data_val)

Assess your model

Use accuracy_vec() to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble.

# accuracy in validation set
accuracy_vec(feat_val$survived, predict(fit_knn_1_20nn, feat_val, type = "class")$.pred_class)
[1] 0.6178862
error_val <- add_row(error_val,
                     model = "knn_1_20nn",
                     acc_val = accuracy_vec(feat_val$survived, 
                                            predict(fit_knn_1_20nn, feat_val)$.pred_class))

error_val
# A tibble: 1 × 2
  model      acc_val
  <chr>        <dbl>
1 knn_1_20nn   0.618

KNN model configuration 2

We build off recipe #1 by adding two numeric features – fare, and a new variable called group_size that combines n_sib_spouse & n_parent_child. We Yeo-Johnson transform both (based on the distributions from modeling EDA) and then range-correct (needed for KNN).

This model meets the following homework requirements: 2+ numeric predictors Categorical > 2 levels Modified categorical More creative feature engineering

Set up recipe

Informed by EDA from your modeling EDA, create a recipe for your second model.

rec_2 <- recipe(survived ~ sparse_title + fare + group_size, data = data_trn) |> 
  step_dummy(sparse_title) |> 
  step_impute_median(fare) |> # one missing value needs imputation
  step_YeoJohnson(c(group_size, fare)) |>   # deal with high skew
  step_range(c(group_size, fare))

Training feature matrix

Use the prep() function to prep your recipe.

rec_prep <- rec_2 |> 
  prep(training = data_trn)

Use the bake() function to generate the feature matrix of the training data.

feat_trn <- rec_prep |> 
  bake(new_data = NULL)

Fit your model

Fit a knn model predicting survived from your training feature matrix.

fit_knn_20nn_2 <- 
  nearest_neighbor(neighbors = 20) |> 
  set_engine("kknn") |> 
  set_mode("classification") |> 
  fit(survived ~ ., data = feat_trn)

Validation feature matrix

Use bake() to generate the feature matrix of the validation data that we will use to assess your model.

feat_val <- rec_prep |> 
  bake(new_data = data_val)

Assess your model

Use accuracy_vec() to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble.

accuracy_vec(feat_val$survived, predict(fit_knn_20nn_2, feat_val, type = "class")$.pred_class)
[1] 0.7804878
error_val <- add_row(error_val,
                     model = "fit_knn_20nn_2",
                     acc_val = accuracy_vec(feat_val$survived, 
                                            predict(fit_knn_20nn_2, feat_val)$.pred_class))

error_val
# A tibble: 2 × 2
  model          acc_val
  <chr>            <dbl>
1 knn_1_20nn       0.618
2 fit_knn_20nn_2   0.780

KNN model configuration 3

For the final model configuration, we add in pclass using dummy features. Note that given the monotonic relationship we identified in modeling EDA, we could have kept pclass as a numeric variable to reduce the number of parameters. We tried that first – it didn’t predict as well, so we switched to using dummhy codes instead. This is the great part about having a validation set – we didn’t know what the right choice was, but we were able to test multiple ways of feature engineering pclass (i.e., multiple model configurations) and let out-of-sample performance tell us what to do!

We also iterate through a few values of k for this configuration & show you how we found a good value.

This model meets the following homework requirements: 2+ numeric predictors Categorical > 2 levels Modified categorical 3+ values of k More creative feature engineering

Set up recipe

Informed by EDA from your modeling EDA, create a recipe for your third model.

rec_3 <- recipe(survived ~ sparse_title + fare + group_size + pclass, 
                data = data_trn) |> 
  step_dummy(c(sparse_title, pclass)) |>
  step_impute_median(fare) |> # impute missing value in fare
  step_YeoJohnson(c(group_size, fare)) |> # deal with high skew
  step_range(c(group_size, fare)) 

Training feature matrix

Use the prep() function to prep your recipe.

rec_prep_3 <- rec_3 |> 
  prep(training = data_trn)

Use the bake() function to generate the feature matrix of the training data.

feat_trn_3 <- rec_prep_3 |> 
  bake(new_data = NULL)

Fit your model

Fit a knn model predicting survived from your training feature matrix.

fit_knn_10nn_3 <- nearest_neighbor(neighbors = 10) |> 
  set_engine("kknn") |> 
  set_mode("classification") |> 
  fit(survived ~ ., data = feat_trn_3)

fit_knn_15nn_3 <- nearest_neighbor(neighbors = 15) |> 
  set_engine("kknn") |>
  set_mode("classification") |>
  fit(survived ~ ., data = feat_trn_3)

fit_knn_20nn_3 <- nearest_neighbor(neighbors = 20) |> 
  set_engine("kknn") |> 
  set_mode("classification") |> 
  fit(survived ~ ., data = feat_trn_3)

fit_knn_25nn_3 <- nearest_neighbor(neighbors = 25) |> 
  set_engine("kknn") |> 
  set_mode("classification") |>
  fit(survived ~ ., data = feat_trn_3)

fit_knn_30nn_3 <- nearest_neighbor(neighbors = 30) |> 
  set_engine("kknn") |> 
  set_mode("classification") |> 
  fit(survived ~ ., data = feat_trn_3)

fit_knn_35nn_3 <- nearest_neighbor(neighbors = 35) |> 
  set_engine("kknn") |> 
  set_mode("classification") |> 
  fit(survived ~ ., data = feat_trn_3)

Validation feature matrix

Use bake() to generate the feature matrix of the validation data that we will use to assess your model.

feat_val_3 <- rec_prep_3 |> 
  bake(new_data = data_val)

Assess your model

Use accuracy_vec() to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble.

accuracy_vec(feat_val_3$survived, predict(fit_knn_10nn_3, feat_val_3)$.pred_class)
[1] 0.7804878
accuracy_vec(feat_val_3$survived, predict(fit_knn_15nn_3, feat_val_3)$.pred_class)
[1] 0.7804878
accuracy_vec(feat_val_3$survived, predict(fit_knn_20nn_3, feat_val_3)$.pred_class)
[1] 0.796748
accuracy_vec(feat_val_3$survived, predict(fit_knn_25nn_3, feat_val_3)$.pred_class)
[1] 0.796748
accuracy_vec(feat_val_3$survived, predict(fit_knn_30nn_3, feat_val_3)$.pred_class)
[1] 0.800813
accuracy_vec(feat_val_3$survived, predict(fit_knn_35nn_3, feat_val_3)$.pred_class)
[1] 0.7926829
error_val <- add_row(error_val,
                     model = "fit_knn_10nn_3",
                     acc_val = accuracy_vec(feat_val_3$survived, 
                                            predict(fit_knn_10nn_3, 
                                                    feat_val_3)$.pred_class)) 
  
error_val <- add_row(error_val,
          model = "fit_knn_15nn_3",
          acc_val = accuracy_vec(feat_val_3$survived, 
                                 predict(fit_knn_15nn_3, 
                                         feat_val_3)$.pred_class))  
error_val <- add_row(error_val,
          model = "fit_knn_20nn_3",
          acc_val = accuracy_vec(feat_val_3$survived, 
                                 predict(fit_knn_20nn_3, 
                                         feat_val_3)$.pred_class))  
error_val <- add_row(error_val,
          model = "fit_knn_25nn_3",
          acc_val = accuracy_vec(feat_val_3$survived, 
                                 predict(fit_knn_25nn_3, 
                                         feat_val_3)$.pred_class))  
error_val <- add_row(error_val,
          model = "fit_knn_30nn_3",
          acc_val = accuracy_vec(feat_val_3$survived, 
                                 predict(fit_knn_30nn_3, 
                                         feat_val_3)$.pred_class))
error_val <- add_row(error_val,
          model = "fit_knn_35nn_3",
          acc_val = accuracy_vec(feat_val_3$survived, 
                                 predict(fit_knn_35nn_3, 
                                         feat_val_3)$.pred_class))


error_val
# A tibble: 8 × 2
  model          acc_val
  <chr>            <dbl>
1 knn_1_20nn       0.618
2 fit_knn_20nn_2   0.780
3 fit_knn_10nn_3   0.780
4 fit_knn_15nn_3   0.780
5 fit_knn_20nn_3   0.797
6 fit_knn_25nn_3   0.797
7 fit_knn_30nn_3   0.801
8 fit_knn_35nn_3   0.793

Example: Visualizing performance by hyperparameter value

Once we are tuning for hyperparameter values (see Unit 5 & 6!), we’ll be able to visualize performance by hyperparameter values. For now, we create it manually.

tibble(neighbors = c(10, 15, 20, 25, 30, 35),
       accuracy = error_val$acc_val[3:8]) |> 
  ggplot(aes(x = neighbors, y = accuracy)) +
  geom_line()

There’s a clear peak at k = 30.

Generate test predictions

Our best model is a 30NN with recipe 3, so I’ll generate predictions here!

best_model <- fit_knn_30nn_3
best_rec <- rec_3 # best recipe
last_name <- "TA"
if (!is.null(best_model)){
  
 # Make test set features with the best recipe
  rec_prep <- best_rec |> 
    prep(training = data_trn)
  
  feat_test <- rec_prep |> 
    bake(new_data = data_test)
  
 # Generate predictions made by the best model
  test_preds <- data_test |> 
    mutate(survived = predict(best_model, feat_test)$.pred_class) |> 
    select(passenger_id, survived) #passenger_id is the id variable to match predictions
  
  # Save predictions as a csv file with your last name in the file name
   write.csv(test_preds, here::here(path_data,str_c("test_preds_", 
                                                   last_name,
                                                   ".csv")))
}