2 Introduction to Regression Models

  • Introduction to key instructional dataset: Ames Housing Prices

  • Use of root mean square error (RMSE) in training and test sets for model performance evaluation

  • The General Linear Model as a machine learning model

    • Extensions to categorical variables
      • Dummy coding
      • One-hot coding
    • Extensions to interactive and non-linear effects of predictors
  • K Nearest Neighbour (KNN)

    • Hyperparameter k
    • Normalization of predictors
    • Extensions to categorical variables

2.1 The Ames Housing Prices Dataset

We will use the Ames Housing Prices dataset as a running example this unit (and some future units and homeworks as well)

  • You can read more about the original dataset created by Dean DeCock

  • This dataset is now hosted by Kaggle

    • Kaggle is an online community of data scientists and machine learners.
    • It allows users to find and publish data sets, explore and build models, compete for best models, and share documented solutions
    • We will use Kaggle a lot. Its a great resource.
  • The data set contains data from home sales in sale of individual residential property in Ames, Iowa from 2006 to 2010

  • The original data set includes 2930 observations of sales price and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous)

  • Here is the original codebook from Kaggle


For the competition, Kaggle provides us with a dataset that includes 1460 observations. Kaggle retained the remaining observations to evaluate model performance for models submitted to its competition.

Why?

  • Models may become very overfit to the datasest they give us through comparing many, many different model configurations by brute force. The true metric for how well a model performs is its performance in new data that it hasnt seen. They make sure that developers have never seen these new data by not making it available to the developers.

Our goal is to build a machine learning model that predicts the sale price for houses from future sales

To begin this project we need to:

  • Load the packages we will need
  • Open the data
  • Tidy the variable names

NOTE: redefine \(select()\) because of \(MASS\) package conflict

library(caret)
library(tidyverse)
library(janitor)

select <- dplyr::select

data_path <- "data"
ames <- read_csv(file.path(data_path, "ames.csv")) %>% 
  clean_names(case = "snake") %>% 
  select(-id)

We will also make a tibble to track training and test error across the models we fit

rmse <- tibble(model = character(), rmse_train = numeric(), rmse_test = numeric()) %>% 
  glimpse()
## Observations: 0
## Variables: 3
## $ model      <chr> 
## $ rmse_train <dbl> 
## $ rmse_test  <dbl>

We will fit various regression models to the data. These will include:

  • Simple linear regression
  • Multiple regression
  • Multiple regressions with interactions, transformations, polynomial predictors
  • K Nearest Neighbour (KNN), a non-parametric model

To fit and then evaluate/compare these models, we need training and test data

Why?

  • Once again, these models will differ in how much they overfit the training set. As p increases, as we include polynomials and interactions, and as we move to non-parametric models, these models will fit training well but perhaps only by fitting the noise. As such, they may not work well with new data. We need new data to evaluate that.

To build models that will work well in new data (e.g., the data that Kaggle has retained from us):

  • We can split the dataset Kaggle gave to us (N=1460) into a private training and test set for our own use

  • We will fit models in train

  • We will evaluate them in test

  • Can use an 80/20 split using rsample package

  • Can stratify on the outcome

  • Here is the code

    • Note the use of \(set.seed()\) to get reproducible splits
library(rsample)

set.seed(19690127)
ames_splits <- initial_split(ames, prop = .8, strata = "sale_price")

train_all <- training(ames_splits)
test <- testing(ames_splits)

This results in N = 1170 in train and N = 290 in test

dim(train_all)
## [1] 1170   80
dim(test)
## [1] 290  80

Lets take a quick look at the available variables in the training data

train_all %>% 
  glimpse()
## Observations: 1,170
## Variables: 80
## $ ms_sub_class    <dbl> 60, 60, 50, 20, 50, 190, 20, 20, 20, 45, 90, 20, 20, 60, 45, 20,…
## $ ms_zoning       <chr> "RL", "RL", "RL", "RL", "RM", "RL", "RL", "RL", "RL", "RM", "RL"…
## $ lot_frontage    <dbl> 65, 68, 85, 75, 51, 50, 70, NA, NA, 51, 72, 66, 70, 101, 57, 75,…
## $ lot_area        <dbl> 8450, 11250, 14115, 10084, 6120, 7420, 11200, 12968, 10920, 6120…
## $ street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ alley           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Grvl", …
## $ lot_shape       <chr> "Reg", "IR1", "IR1", "Reg", "Reg", "Reg", "Reg", "IR2", "IR1", "…
## $ land_contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "…
## $ utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "All…
## $ lot_config      <chr> "Inside", "Inside", "Inside", "Inside", "Inside", "Corner", "Ins…
## $ land_slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "…
## $ neighborhood    <chr> "CollgCr", "CollgCr", "Mitchel", "Somerst", "OldTown", "BrkSide"…
## $ condition1      <chr> "Norm", "Norm", "Norm", "Norm", "Artery", "Artery", "Norm", "Nor…
## $ condition2      <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Artery", "Norm", "Norm"…
## $ bldg_type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "2fmCon", "1Fam", "1Fam"…
## $ house_style     <chr> "2Story", "2Story", "1.5Fin", "1Story", "1.5Fin", "1.5Unf", "1St…
## $ overall_qual    <dbl> 7, 7, 5, 8, 7, 5, 5, 5, 6, 7, 4, 5, 5, 8, 7, 8, 5, 5, 8, 5, 8, 5…
## $ overall_cond    <dbl> 5, 5, 5, 5, 5, 6, 5, 6, 5, 8, 5, 5, 6, 5, 7, 5, 7, 8, 5, 7, 5, 6…
## $ year_built      <dbl> 2003, 2001, 1993, 2004, 1931, 1939, 1965, 1962, 1960, 1929, 1967…
## $ year_remod_add  <dbl> 2003, 2002, 1995, 2005, 1950, 1950, 1965, 1962, 1960, 2001, 1967…
## $ roof_style      <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "Hip", "Hi…
## $ roof_matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "CompShg"…
## $ exterior1st     <chr> "VinylSd", "VinylSd", "VinylSd", "VinylSd", "BrkFace", "MetalSd"…
## $ exterior2nd     <chr> "VinylSd", "VinylSd", "VinylSd", "VinylSd", "Wd Shng", "MetalSd"…
## $ mas_vnr_type    <chr> "BrkFace", "BrkFace", "None", "Stone", "None", "None", "None", "…
## $ mas_vnr_area    <dbl> 196, 162, 0, 186, 0, 0, 0, 0, 212, 0, 0, 0, 0, 380, 0, 281, 0, 0…
## $ exter_qual      <chr> "Gd", "Gd", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA"…
## $ exter_cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"…
## $ foundation      <chr> "PConc", "PConc", "Wood", "PConc", "BrkTil", "BrkTil", "CBlock",…
## $ bsmt_qual       <chr> "Gd", "Gd", "Gd", "Ex", "TA", "TA", "TA", "TA", "TA", "TA", NA, …
## $ bsmt_cond       <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", NA, …
## $ bsmt_exposure   <chr> "No", "Mn", "No", "Av", "No", "No", "No", "No", "No", "No", NA, …
## $ bsmt_fin_type1  <chr> "GLQ", "GLQ", "GLQ", "GLQ", "Unf", "GLQ", "Rec", "ALQ", "BLQ", "…
## $ bsmt_fin_sf1    <dbl> 706, 486, 732, 1369, 0, 851, 906, 737, 733, 0, 0, 646, 504, 0, 0…
## $ bsmt_fin_type2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "…
## $ bsmt_fin_sf2    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 668, 0, 486, …
## $ bsmt_unf_sf     <dbl> 150, 434, 64, 317, 952, 140, 134, 175, 520, 832, 0, 468, 525, 11…
## $ total_bsmt_sf   <dbl> 856, 920, 796, 1686, 952, 991, 1040, 912, 1253, 832, 0, 1114, 10…
## $ heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ heating_qc      <chr> "Ex", "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "TA", "TA", "Ex", "TA"…
## $ central_air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "FuseF", "SBrkr", "SBrkr", "…
## $ x1st_flr_sf     <dbl> 856, 920, 796, 1694, 1022, 1077, 1040, 912, 1253, 854, 1296, 111…
## $ x2nd_flr_sf     <dbl> 854, 866, 566, 0, 752, 0, 0, 0, 0, 0, 0, 0, 0, 1218, 0, 0, 0, 0,…
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gr_liv_area     <dbl> 1710, 1786, 1362, 1694, 1774, 1077, 1040, 912, 1253, 854, 1296, …
## $ bsmt_full_bath  <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ full_bath       <dbl> 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 2, 1, 1, 2, 1, 2, 1…
## $ half_bath       <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bedroom_abv_gr  <dbl> 3, 3, 1, 3, 2, 2, 3, 2, 2, 2, 2, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 2…
## $ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ kitchen_qual    <chr> "Gd", "Gd", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA"…
## $ tot_rms_abv_grd <dbl> 8, 6, 5, 7, 8, 5, 5, 4, 5, 5, 6, 6, 6, 9, 6, 7, 6, 6, 7, 5, 7, 6…
## $ functional      <chr> "Typ", "Typ", "Typ", "Typ", "Min1", "Typ", "Typ", "Typ", "Typ", …
## $ fireplaces      <dbl> 0, 1, 0, 1, 2, 2, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 2…
## $ fireplace_qu    <chr> NA, "TA", NA, "Gd", "TA", "TA", NA, NA, "Fa", NA, NA, NA, NA, "G…
## $ garage_type     <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Det…
## $ garage_yr_blt   <dbl> 2003, 2001, 1993, 2004, 1931, 1939, 1965, 1962, 1960, 1991, 1967…
## $ garage_finish   <chr> "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "Unf", "Unf", "RFn", "…
## $ garage_cars     <dbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 3, 1, 2, 2, 1, 3, 2, 3, 1…
## $ garage_area     <dbl> 548, 608, 480, 636, 468, 205, 384, 352, 352, 576, 516, 576, 294,…
## $ garage_qual     <chr> "TA", "TA", "TA", "TA", "Fa", "Gd", "TA", "TA", "TA", "TA", "TA"…
## $ garage_cond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"…
## $ paved_drive     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ wood_deck_sf    <dbl> 0, 0, 40, 255, 90, 0, 0, 140, 0, 48, 0, 0, 0, 240, 0, 171, 100, …
## $ open_porch_sf   <dbl> 61, 42, 30, 57, 0, 4, 0, 0, 213, 112, 0, 102, 0, 154, 0, 159, 11…
## $ enclosed_porch  <dbl> 0, 0, 0, 0, 205, 0, 0, 0, 176, 0, 0, 0, 0, 0, 205, 0, 0, 0, 0, 0…
## $ x3ssn_porch     <dbl> 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ screen_porch    <dbl> 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_qc         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ fence           <chr> NA, NA, "MnPrv", NA, NA, NA, NA, NA, "GdWo", "GdPrv", NA, NA, "M…
## $ misc_feature    <chr> NA, NA, "Shed", NA, NA, NA, NA, NA, NA, NA, "Shed", NA, NA, NA, …
## $ misc_val        <dbl> 0, 0, 700, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mo_sold         <dbl> 2, 9, 10, 8, 4, 1, 2, 9, 5, 7, 10, 6, 5, 11, 6, 9, 6, 5, 7, 5, 5…
## $ yr_sold         <dbl> 2008, 2008, 2009, 2007, 2008, 2008, 2008, 2008, 2008, 2007, 2006…
## $ sale_type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD"…
## $ sale_condition  <chr> "Normal", "Normal", "Normal", "Normal", "Abnorml", "Normal", "No…
## $ sale_price      <dbl> 208500, 223500, 143000, 307000, 129900, 118000, 129500, 144000, …

Data exploration is VERY important when given a new dataset

  • We will return in later lectures to common methods for exploratory data analysis

  • One critical step is to consider missing data

    • We dont have that much and will only use variables with no missing data for now
    • Some of this missing data is not really missing but another level of the variable
    • In other instances, we need to impute or drop those variables (more in later units)
train_all %>% 
  summarize_all(~ sum(is.na(.))) %>% 
  glimpse()
## Observations: 1
## Variables: 80
## $ ms_sub_class    <int> 0
## $ ms_zoning       <int> 0
## $ lot_frontage    <int> 205
## $ lot_area        <int> 0
## $ street          <int> 0
## $ alley           <int> 1096
## $ lot_shape       <int> 0
## $ land_contour    <int> 0
## $ utilities       <int> 0
## $ lot_config      <int> 0
## $ land_slope      <int> 0
## $ neighborhood    <int> 0
## $ condition1      <int> 0
## $ condition2      <int> 0
## $ bldg_type       <int> 0
## $ house_style     <int> 0
## $ overall_qual    <int> 0
## $ overall_cond    <int> 0
## $ year_built      <int> 0
## $ year_remod_add  <int> 0
## $ roof_style      <int> 0
## $ roof_matl       <int> 0
## $ exterior1st     <int> 0
## $ exterior2nd     <int> 0
## $ mas_vnr_type    <int> 7
## $ mas_vnr_area    <int> 7
## $ exter_qual      <int> 0
## $ exter_cond      <int> 0
## $ foundation      <int> 0
## $ bsmt_qual       <int> 27
## $ bsmt_cond       <int> 27
## $ bsmt_exposure   <int> 28
## $ bsmt_fin_type1  <int> 27
## $ bsmt_fin_sf1    <int> 0
## $ bsmt_fin_type2  <int> 28
## $ bsmt_fin_sf2    <int> 0
## $ bsmt_unf_sf     <int> 0
## $ total_bsmt_sf   <int> 0
## $ heating         <int> 0
## $ heating_qc      <int> 0
## $ central_air     <int> 0
## $ electrical      <int> 1
## $ x1st_flr_sf     <int> 0
## $ x2nd_flr_sf     <int> 0
## $ low_qual_fin_sf <int> 0
## $ gr_liv_area     <int> 0
## $ bsmt_full_bath  <int> 0
## $ bsmt_half_bath  <int> 0
## $ full_bath       <int> 0
## $ half_bath       <int> 0
## $ bedroom_abv_gr  <int> 0
## $ kitchen_abv_gr  <int> 0
## $ kitchen_qual    <int> 0
## $ tot_rms_abv_grd <int> 0
## $ functional      <int> 0
## $ fireplaces      <int> 0
## $ fireplace_qu    <int> 545
## $ garage_type     <int> 60
## $ garage_yr_blt   <int> 60
## $ garage_finish   <int> 60
## $ garage_cars     <int> 0
## $ garage_area     <int> 0
## $ garage_qual     <int> 60
## $ garage_cond     <int> 60
## $ paved_drive     <int> 0
## $ wood_deck_sf    <int> 0
## $ open_porch_sf   <int> 0
## $ enclosed_porch  <int> 0
## $ x3ssn_porch     <int> 0
## $ screen_porch    <int> 0
## $ pool_area       <int> 0
## $ pool_qc         <int> 1165
## $ fence           <int> 941
## $ misc_feature    <int> 1127
## $ misc_val        <int> 0
## $ mo_sold         <int> 0
## $ yr_sold         <int> 0
## $ sale_type       <int> 0
## $ sale_condition  <int> 0
## $ sale_price      <int> 0

It is also critical to look closely at your outcome variable: \(sale\_price\)

Here is standard ggplot code for a histogram using minimal ink

library(ggthemes)
theme_set(theme_classic())

ggplot(data=train_all, aes(sale_price)) + 
  geom_histogram()


But you can also do quick plots in ggplot

qplot(train_all$sale_price)


\(summary()\) is useful too

summary(train_all$sale_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129925  163000  182404  214000  755000

and now lets start considering machine learning models…


2.2 The General Linear Model

We will start with a review of the use of the general linear model (GLM) as a machine learning model because you should be very familiar with this statistical model at this point


2.2.1 Simple Regression

We will start with simple linear regression with one predictor

\[Y = \beta_0 + \beta_1*X_1 + \epsilon\]

Applied to our regression problem, we might fit a model such as:

\[sale\_price = \beta_0 + \beta_1*gr\_liv\_area + \epsilon\]

We need to estimate \(\beta_0\) and \(\beta_1\) using our training dataset

You already know how to do this using \(lm()\) in base R. However, we will use \(train()\) from the caret package.


We use the caret package because:

  • It provides a standard interface to hundreds of types of machine learning models with a consistent interface
  • It provides very strong and easy pre-processing routines (e.g., missing data, scaling, transformations, near-zero variance, collinearity)
  • It simplifies cross validation
  • It supports numerous performance metrics
  • It makes life easy
  • …but we need to fully understand what it is doing b/c sometimes we need to do things outside of caret

We need to:

  • Build a tibble with Y and all Xs (e.g., train_sr1)
  • Set up a train control object that will be shared by many models
  • Fit the model in the training dataset
train_sr1 <- select(train_all, sale_price, gr_liv_area)

trn_ctrl <- trainControl(method = "none")

sr1 <- train(sale_price ~ gr_liv_area, data = train_sr1,
            method = "glm", trControl = trn_ctrl)

\(train()\) returns a model object that contains a lot of information

You can begin to explore these on your own

names(sr1)
##  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
##  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
## [11] "finalModel"   "preProcess"   "trainingData" "resample"     "resampledCM" 
## [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      
## [21] "terms"        "coefnames"    "xlevels"
names(sr1$finalModel)
##  [1] "coefficients"      "residuals"         "fitted.values"     "effects"          
##  [5] "R"                 "rank"              "qr"                "family"           
##  [9] "linear.predictors" "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"     "df.residual"      
## [17] "df.null"           "y"                 "converged"         "boundary"         
## [21] "model"             "formula"           "terms"             "data"             
## [25] "offset"            "control"           "method"            "contrasts"        
## [29] "xlevels"           "xNames"            "problemType"       "tuneValue"        
## [33] "obsLevels"         "param"

We can get the coefficients for this model using an extractor function

This is good generally practice in R

coef(sr1$finalModel)
## (Intercept) gr_liv_area 
##   9461.4933    113.8802

This indicates that our prediction model is: \[\hat{sale\_price} = 9461.5 + 113.9 * gr\_liv\_area\]


We can also get parameter estimates, standard errors and statistical tests for each \(\beta\) = 0 using \(summary()\)

summary(sr1)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -357215   -29534     -561    22424   332983  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9461.49    5044.03   1.876   0.0609 .  
## gr_liv_area   113.88       3.14  36.270   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3167018547)
## 
##     Null deviance: 7.8654e+12  on 1169  degrees of freedom
## Residual deviance: 3.6991e+12  on 1168  degrees of freedom
## AIC: 28919
## 
## Number of Fisher Scoring iterations: 2

NOTE: We used the generalized linear model algorithm with default (gaussian) error term and default (linear) link function. Solved by maximum likelihood but statistically equivalent to the least squares solution.


We can get the \(\hat{sale\_price}\) in our training dataset using \(predict()\)

Here are the first 10 values for \(\hat{sale\_price}\):

y_hat <- predict(sr1)

y_hat[1:10]
##        1        2        3        4        5        6        7        8        9       10 
## 204196.6 212851.5 164566.3 202374.5 211484.9 132110.4 127896.9 113320.2 152153.3 106715.2

How can we tell how well this model fits our training data?

  • We could plot the prediction line over the raw data to visualize the fit (see below). To quantify the fit, we need a performance metric. Root mean square error would tell us the average size of the errors in the original units, dollars

What else do we really want to know?

  • Training error is not that informative. We want to know about performance/errors for this model in new data such as the test dataset

Here is a plot of \(\hat{sale\_price}\) by \(gr\_liv\_area\) superimposed over a scatterplot of the raw data from the training dataset

ggplot(data = train_sr1, aes(x = gr_liv_area)) +
  geom_point(aes(y = sale_price), color = "gray") +
  geom_line(aes(y = predict(sr1)), size = 1.25, color = "blue") + 
  ggtitle("Training")

What do you notice in this plot, why might this be, and what might you consider doing?

  • There is heteroscadasticity in the errors. The size of the errors increases for increasing predicted sale_price. We could transform (e.g., log) sale_price to correct this. In fact, Kaggle does this in their competition for this dataset and you will do this on the homework. For class, we will continue to use raw sale_price. Pros/Cons?

Here is code to calculate RMSE explicitly (for understanding) in our training dataset

rmse_demo <- tibble(sale_price = train_sr1$sale_price, 
                     predicted = y_hat, 
                     diff = sale_price - predicted, 
                     diff_sqr = diff^2)
rmse_demo
## # A tibble: 1,170 x 4
##    sale_price predicted    diff     diff_sqr
##         <dbl>     <dbl>   <dbl>        <dbl>
##  1     208500   204197.   4303.    18519487.
##  2     223500   212851.  10649.   113391297.
##  3     143000   164566. -21566.   465104250.
##  4     307000   202374. 104626. 10946497347.
##  5     129900   211485. -81585.  6656096405.
##  6     118000   132110. -14110.   199104212.
##  7     129500   127897.   1603.     2570048.
##  8     144000   113320.  30680.   941249991.
##  9     157000   152153.   4847.    23490133.
## 10     132000   106715.  25285.   639323500.
## # … with 1,160 more rows
sqrt(sum(rmse_demo$diff_sqr)/nrow(rmse_demo))
## [1] 56228.15

However, you can use the \(RMSE()\) function in the caret package for this more easily

rmse_train_sr1 <- RMSE(predict(sr1), train_sr1$sale_price)
rmse_train_sr1 
## [1] 56228.15

We recognize that this model may be somewhat overfit to the training data set

Do you expect overfitting to be a big issue here? Why or why not?

  • In this instance, there shouldnt be too much/any overfitting. The model (simple linear regression) is not very flexible, p is low (1 predictor), and the sample size is pretty large (N = 1460). This will very rarely be true in real modeling situations

Regardless, we care most about how this model will perform in new data

  • Note that for new data we need to provide the new data to \(predict()\)
  • Note use of \(kable()\) for simple tables in R Markdown
rmse_test_sr1 <- RMSE(predict(sr1, test), test$sale_price)

rmse <- bind_rows(rmse, tibble(model = "simple regression", rmse_train = rmse_train_sr1, rmse_test = rmse_test_sr1))
library(knitr)
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92

The average error for our model’s predictions is about $56K. Not very good.


2.2.2 Extension to Multiple Linear Regression

We can improve model performance by moving from simple linear regression to multiple regression

names(train_all)
##  [1] "ms_sub_class"    "ms_zoning"       "lot_frontage"    "lot_area"       
##  [5] "street"          "alley"           "lot_shape"       "land_contour"   
##  [9] "utilities"       "lot_config"      "land_slope"      "neighborhood"   
## [13] "condition1"      "condition2"      "bldg_type"       "house_style"    
## [17] "overall_qual"    "overall_cond"    "year_built"      "year_remod_add" 
## [21] "roof_style"      "roof_matl"       "exterior1st"     "exterior2nd"    
## [25] "mas_vnr_type"    "mas_vnr_area"    "exter_qual"      "exter_cond"     
## [29] "foundation"      "bsmt_qual"       "bsmt_cond"       "bsmt_exposure"  
## [33] "bsmt_fin_type1"  "bsmt_fin_sf1"    "bsmt_fin_type2"  "bsmt_fin_sf2"   
## [37] "bsmt_unf_sf"     "total_bsmt_sf"   "heating"         "heating_qc"     
## [41] "central_air"     "electrical"      "x1st_flr_sf"     "x2nd_flr_sf"    
## [45] "low_qual_fin_sf" "gr_liv_area"     "bsmt_full_bath"  "bsmt_half_bath" 
## [49] "full_bath"       "half_bath"       "bedroom_abv_gr"  "kitchen_abv_gr" 
## [53] "kitchen_qual"    "tot_rms_abv_grd" "functional"      "fireplaces"     
## [57] "fireplace_qu"    "garage_type"     "garage_yr_blt"   "garage_finish"  
## [61] "garage_cars"     "garage_area"     "garage_qual"     "garage_cond"    
## [65] "paved_drive"     "wood_deck_sf"    "open_porch_sf"   "enclosed_porch" 
## [69] "x3ssn_porch"     "screen_porch"    "pool_area"       "pool_qc"        
## [73] "fence"           "misc_feature"    "misc_val"        "mo_sold"        
## [77] "yr_sold"         "sale_type"       "sale_condition"  "sale_price"

Lets expand our model to also include \(lot\_area\), \(year\_built\), \(overall\_qual\), and \(garage\_cars\)

Notes:

  • Use of ‘.’ in the model formula in \(train()\)
  • Use of earlier established trn_ctrl object
train_mr1 <- train_all %>%   
  select(sale_price, 
         gr_liv_area, 
         lot_area, 
         year_built, 
         overall_qual, 
         garage_cars)

mr1 <- train(sale_price ~ ., data = train_mr1,
            method = "glm", trControl = trn_ctrl)

This yields:

coef(mr1$finalModel)
##   (Intercept)   gr_liv_area      lot_area    year_built  overall_qual   garage_cars 
## -8.521196e+05  5.831055e+01  9.986847e-01  3.888901e+02  2.348331e+04  1.449096e+04

\[\hat{sale\_price} = -8.521196\times 10^{5} + 58.3 * gr\_liv\_area + 1 * lot\_area + 388.9 * year\_built + 2.34833\times 10^{4} * overall\_qual + 1.4491\times 10^{4} * garage\_cars\]

Compared with our previous simple regression model: \[\hat{sale\_price} = 9461.5 + 113.9 * gr\_liv\_area\]

What are the implications of the likely correlations among many of these predictors?

  • The multiple regression model coefficients represent unique effects, controlling for all other variables in the model. You can see how the unique effect of gr_liv_area is smaller than its overall effect from the simple regression. This also means that the overall predictive strength of the model wont be a sum of the effects of each predictor considered in isolation - it will likely be less. Also, if the correlations are high, problems with multicollinearity will emerge. This will yield large standard errors which means that the models will start to have more variance when fit in different training datasets! We will soon learn about other regularized versions of the GLM that dont have these issues with correlated predictors.

Useful visualization of correlation matrix offered by \(corrplot()\)

library(corrplot)
  
corrplot.mixed(cor(train_mr1))

Of course, multicollinearity and SE inflation for any variable is based on its \(R^2\) using all other predictors


How well does this more complex model do in:

  • training?
  • and test?
rmse_train_mr1 <- RMSE(predict(mr1), train_mr1$sale_price)
rmse_test_mr1 <- RMSE(predict(mr1, test), test$sale_price)

rmse <- bind_rows(rmse, tibble(model = "multiple regression", rmse_train = rmse_train_mr1, rmse_test = rmse_test_mr1))
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77

What are your observations about these performance results?

  • First, you can also see that this still pretty simple multiple regression model is beginning to overfit the training data. You can tell this because its performance in test (average error = ~41K) is worse than in training (~ 39K). However, the more complex model is still definitely better at making predictions in new data (average prediction error of ~$41K vs. $56K). It seems to be a better model

2.2.3 Extension to Categorical Predictors

Many important predictors in our models may be categorical (i.e., nominal) rather than quantitative (e.g., ordinal, interval, ratio)

How do you handle categorical predictors in linear regression?

  • We use a categorical coding scheme such as dummy codes, contrast codes, etc. This yields multiple ‘regressors’ (or features) for each categorical predictor. Specifically, an m level categorical predictor resulted in m-1 regressors

For example, with dummy coding:

Race Race1 Race2
Black 1 0
White 0 1
Other 0 0

Notes:

  • Any full rank (m-1) set of regressors regardless of coding system predicts exactly the same
  • Preference among coding systems is simply to get single df contrasts of theoretical importance
  • Final (mth) regressor is not included b/c its is completely redundant (perfectly multicollinear) with other regressors. This would also prevent the model from fitting (‘dummy variable trap’).

Lets consider some potentially important categorical predictors for our model

table(train_all$ms_zoning)
## 
## C (all)      FV      RH      RL      RM 
##       8      57      12     913     180
table(train_all$lot_config)
## 
##  Corner CulDSac     FR2     FR3  Inside 
##     212      78      34       4     842
table(train_all$bldg_type)
## 
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##    981     24     43     34     88

There are many ways to code dummy variables in R. We focus on two:

  • Using \(factor()\) and \(contrasts()\) from base R
  • Using \(dummyVars()\) from caret package

For both methods, we will calculate the dummy codes in the full data and then re-split into train/test (Why?)


You can use \(factor()\) and \(contrasts()\) from base R to set contrasts for specific variables

This converts character variables to factors and sets contrasts

It does NOT add the dummy predictors directly to the tibble. R handles that internally.

  • Good for including interactions with these factors
  • Need to consider how missing data is handled?
ames <- ames %>% 
  mutate(ms_zoning = factor(ms_zoning),
         lot_config = factor(lot_config),
         bldg_type = factor(bldg_type))
contrasts(ames$ms_zoning) <- contr.treatment(length(levels(ames$ms_zoning)))
contrasts(ames$lot_config) <- contr.treatment(length(levels(ames$lot_config)))
contrasts(ames$bldg_type) <- contr.treatment(length(levels(ames$bldg_type)))

names(ames)
##  [1] "ms_sub_class"    "ms_zoning"       "lot_frontage"    "lot_area"       
##  [5] "street"          "alley"           "lot_shape"       "land_contour"   
##  [9] "utilities"       "lot_config"      "land_slope"      "neighborhood"   
## [13] "condition1"      "condition2"      "bldg_type"       "house_style"    
## [17] "overall_qual"    "overall_cond"    "year_built"      "year_remod_add" 
## [21] "roof_style"      "roof_matl"       "exterior1st"     "exterior2nd"    
## [25] "mas_vnr_type"    "mas_vnr_area"    "exter_qual"      "exter_cond"     
## [29] "foundation"      "bsmt_qual"       "bsmt_cond"       "bsmt_exposure"  
## [33] "bsmt_fin_type1"  "bsmt_fin_sf1"    "bsmt_fin_type2"  "bsmt_fin_sf2"   
## [37] "bsmt_unf_sf"     "total_bsmt_sf"   "heating"         "heating_qc"     
## [41] "central_air"     "electrical"      "x1st_flr_sf"     "x2nd_flr_sf"    
## [45] "low_qual_fin_sf" "gr_liv_area"     "bsmt_full_bath"  "bsmt_half_bath" 
## [49] "full_bath"       "half_bath"       "bedroom_abv_gr"  "kitchen_abv_gr" 
## [53] "kitchen_qual"    "tot_rms_abv_grd" "functional"      "fireplaces"     
## [57] "fireplace_qu"    "garage_type"     "garage_yr_blt"   "garage_finish"  
## [61] "garage_cars"     "garage_area"     "garage_qual"     "garage_cond"    
## [65] "paved_drive"     "wood_deck_sf"    "open_porch_sf"   "enclosed_porch" 
## [69] "x3ssn_porch"     "screen_porch"    "pool_area"       "pool_qc"        
## [73] "fence"           "misc_feature"    "misc_val"        "mo_sold"        
## [77] "yr_sold"         "sale_type"       "sale_condition"  "sale_price"
contrasts(ames$ms_zoning)
##         2 3 4 5
## C (all) 0 0 0 0
## FV      1 0 0 0
## RH      0 1 0 0
## RL      0 0 1 0
## RM      0 0 0 1

You can also do all character and factor variables at once by using \(mutate\_if()\) and setting appropriate contrasts in \(options()\)

Note that this will override any factor contrasts you have set and change your default contrast options. Adjust accordingly.

options(contrasts=c('contr.treatment','contr.poly'))
ames <- ames %>% 
  mutate_if(is.factor, as.character) %>% 
  mutate_if(is.character, as.factor)

You can use \(dummyVars()\) from caret to create explicit dummy code variables:

  • Note use of \(fullRank = TRUE\) (not default) to get true dummy codes (vs. one-hot codes)
  • Can use with character or factor variables
  • This will create new columns in the tibble for the dummy variables
  • This also retains original character variable in the tibble (modify as needed)
  • BUG REPORT: \(sep\) does not currently work with character variables. May want to change to factor first to retain snake case variable names.
ames <- read_csv(file.path(data_path, "ames.csv")) %>% 
  clean_names(case = "snake") %>% 
  select(-id) %>% 
  mutate_if(is.character, as.factor) #to address variable name bug in dummyVars()

dmy_codes <- dummyVars(~ ms_zoning + lot_config + bldg_type, fullRank = TRUE, sep = "_", data = ames)

ames <- ames %>% 
  bind_cols(as_tibble(predict(dmy_codes, ames))) %>% 
  glimpse()
## Observations: 1,460
## Variables: 92
## $ ms_sub_class       <dbl> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20, 20, …
## $ ms_zoning          <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, RL, RL, R…
## $ lot_frontage       <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, NA, 5…
## $ lot_area           <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 6120, 74…
## $ street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, P…
## $ alley              <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lot_shape          <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg, Reg, IR1, I…
## $ land_contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, L…
## $ utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPu…
## $ lot_config         <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Corner, Ins…
## $ land_slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ neighborhood       <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel, Somerst…
## $ condition1         <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Artery, Arte…
## $ condition2         <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Artery,…
## $ bldg_type          <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 2fmCon,…
## $ house_style        <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, 1Story, 2Stor…
## $ overall_qual       <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5, 5, 8…
## $ overall_cond       <dbl> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5, 6, 5…
## $ year_built         <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 1939, 1…
## $ year_remod_add     <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 1950, 1…
## $ roof_style         <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable, Gable, Gable…
## $ roof_matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg, CompShg…
## $ exterior1st        <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, VinylSd, VinylSd…
## $ exterior2nd        <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, VinylSd, VinylSd…
## $ mas_vnr_type       <fct> BrkFace, None, BrkFace, None, BrkFace, None, Stone, Stone, No…
## $ mas_vnr_area       <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, 212, …
## $ exter_qual         <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, TA, T…
## $ exter_cond         <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T…
## $ foundation         <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc, CBlock, Brk…
## $ bsmt_qual          <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, TA, Gd, TA, T…
## $ bsmt_cond          <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T…
## $ bsmt_exposure      <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, No, Av, No, N…
## $ bsmt_fin_type1     <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ, Rec, GLQ, A…
## $ bsmt_fin_sf1       <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 998, 73…
## $ bsmt_fin_type2     <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf, Unf, Unf, U…
## $ bsmt_fin_sf2       <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ bsmt_unf_sf        <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 177, 17…
## $ total_bsmt_sf      <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 1040, 1…
## $ heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, G…
## $ heating_qc         <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, TA, Ex, TA, E…
## $ central_air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, FuseF…
## $ x1st_flr_sf        <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, 1040,…
## $ x2nd_flr_sf        <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0, 0, 0…
## $ low_qual_fin_sf    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gr_liv_area        <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 1077, 1…
## $ bsmt_full_bath     <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0…
## $ bsmt_half_bath     <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ full_bath          <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 3…
## $ half_bath          <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1…
## $ bedroom_abv_gr     <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3, 3, 4…
## $ kitchen_abv_gr     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1…
## $ kitchen_qual       <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, TA, T…
## $ tot_rms_abv_grd    <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6, 6, …
## $ functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Typ, Typ, Typ, …
## $ fireplaces         <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0, 0, 1…
## $ fireplace_qu       <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, NA, Gd, Fa, N…
## $ garage_type        <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, Attchd, Attch…
## $ garage_yr_blt      <dbl> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 1939, 1…
## $ garage_finish      <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn, Unf, Fin, U…
## $ garage_cars        <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2, 1, 3…
## $ garage_area        <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 736, 3…
## $ garage_qual        <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, TA, TA, TA, T…
## $ garage_cond        <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T…
## $ paved_drive        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ wood_deck_sf       <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160, 0, …
## $ open_porch_sf      <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213, 112,…
## $ enclosed_porch     <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, 0, 0,…
## $ x3ssn_porch        <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ screen_porch       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_area          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_qc            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ fence              <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, NA, NA, NA, Gd…
## $ misc_feature       <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, NA, NA, NA, N…
## $ misc_val           <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700, 500,…
## $ mo_sold            <dbl> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10, 6, …
## $ yr_sold            <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 2008, 2…
## $ sale_type          <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New, WD, New, WD,…
## $ sale_condition     <fct> Normal, Normal, Normal, Abnorml, Normal, Normal, Normal, Norm…
## $ sale_price         <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, 20000…
## $ ms_zoning_FV       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ms_zoning_RH       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ms_zoning_RL       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
## $ ms_zoning_RM       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ lot_config_CulDSac <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ lot_config_FR2     <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lot_config_FR3     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lot_config_Inside  <dbl> 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0…
## $ bldg_type_2fmCon   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bldg_type_Duplex   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ bldg_type_Twnhs    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bldg_type_TwnhsE   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

You can use \(dummyVars()\) for one-hot codes as well:

  • One-hot codes are less than full rank. They include one new dummy variable for EVERY level of the categorical variable.

  • Each dummy variable is coded 1 for its target level and zero for all other levels

For example, with one-hot coding:

Race Race1 Race2 Race3
Black 1 0 0
White 0 1 0
Other 0 0 1
  • This set will not work with types of models that dont regularize in some manner
    • Dont use with lm, glm models
    • Can (and should?) use with lasso, ridge, glmnet, svm models (more on these later)
  • Coefficients are not interpretable as the contrast of target vs. reference. (Instead, its simply target vs. all other controlling for all other variables?)

Notes:

  • Note use of \(drop2nd = TRUE\)
  • Note use of ‘~ .’ to code all factor and character variables at the same time. Ignores quantitative variables. Be careful.
dmy_codes <- dummyVars(~ ., fullRank = FALSE, drop2nd = TRUE, data = ames)

ames <- ames %>% 
  bind_cols(as_tibble(predict(dmy_codes, ames))) %>% 
  glimpse()

Regardless of the method (using \(dummyVars()\) method here), we follow with a re-split into train/test and make subset of train for model fitting

Note use of same seed as before

set.seed(19690127)
ames_splits <- initial_split(ames, prop = .8, strata = "sale_price")
train_all <- training(ames_splits)
test <- testing(ames_splits)

train_mr2 <- train_all %>%   
  select(sale_price, 
         gr_liv_area, 
         lot_area, 
         year_built, 
         overall_qual, 
         garage_cars, 
         starts_with("ms_zoning"),         
         starts_with("lot_config"),
         starts_with("bldg_type")) %>% 
  select(-ms_zoning, -lot_config, -bldg_type) %>% 
  glimpse()
## Observations: 1,170
## Variables: 18
## $ sale_price         <dbl> 208500, 223500, 143000, 307000, 129900, 118000, 129500, 14400…
## $ gr_liv_area        <dbl> 1710, 1786, 1362, 1694, 1774, 1077, 1040, 912, 1253, 854, 129…
## $ lot_area           <dbl> 8450, 11250, 14115, 10084, 6120, 7420, 11200, 12968, 10920, 6…
## $ year_built         <dbl> 2003, 2001, 1993, 2004, 1931, 1939, 1965, 1962, 1960, 1929, 1…
## $ overall_qual       <dbl> 7, 7, 5, 8, 7, 5, 5, 5, 6, 7, 4, 5, 5, 8, 7, 8, 5, 5, 8, 5, 8…
## $ garage_cars        <dbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 3, 1, 2, 2, 1, 3, 2, 3…
## $ ms_zoning_FV       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ms_zoning_RH       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ms_zoning_RL       <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1…
## $ ms_zoning_RM       <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0…
## $ lot_config_CulDSac <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lot_config_FR2     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lot_config_FR3     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lot_config_Inside  <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1…
## $ bldg_type_2fmCon   <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bldg_type_Duplex   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bldg_type_Twnhs    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bldg_type_TwnhsE   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…

Fit the model in training set

Evaluate the model in:

  • training
  • test
mr2 <- train(sale_price ~ ., data = train_mr2,
            method = "glm", trControl = trn_ctrl)


rmse_train_mr2 <- RMSE(predict(mr2), train_mr2$sale_price)
rmse_test_mr2 <- RMSE(predict(mr2, test), test$sale_price)

rmse <- bind_rows(rmse, tibble(model = "multiple regression w/categorical", rmse_train = rmse_train_mr2, rmse_test = rmse_test_mr2))
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26

2.2.4 Extension to Interactive Models

There may be interactive effects among our predictors. These are easy to model as well.

For example, it may be that the relationship between \(year\_built\) and \(sale\_price\) depends on \(overall\_qual\). Lets include that in our model.

You can manually code interactions (as product terms) into the training and test

  • Review the training and test error for our various incremental models below
  • This interaction doesnt help
train_mr3 <- train_mr2 %>% 
  mutate(year_built_X_overall_qual = year_built * overall_qual)
test <- test %>% 
  mutate(year_built_X_overall_qual = year_built * overall_qual)

mr3_int1 <- train(sale_price ~ ., data = train_mr3,
            method = "glm", trControl = trn_ctrl)

rmse_train_mr3_int1 <- RMSE(predict(mr3_int1), train_mr3$sale_price)
rmse_test_mr3_int1 <- RMSE(predict(mr3_int1, test), test$sale_price)

rmse <- bind_rows(rmse, tibble(model = "multiple regression w/focal interaction", rmse_train = rmse_train_mr3_int1, rmse_test = rmse_test_mr3_int1))
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26
multiple regression w/focal interaction 36060.21 40673.83

Alternatively, you can include interactions as part of the formula in \(train()\)

  • Note that \(predict()\) needs to have training data explicitly provided. It has problems with \(year\_built:overall\_qual\) otherwise.

  • Comparison below shows the same result for the two methods for coding interactions on training error

train_mr3 <- train_mr3 %>% 
  select(-year_built_X_overall_qual)

mr3_int2 <- train(sale_price ~ . + year_built:overall_qual, data = train_mr3,
            method = "glm", trControl = trn_ctrl)

RMSE(predict(mr3_int1), train_mr3$sale_price)
## [1] 36060.21
RMSE(predict(mr3_int2, train_mr3), train_mr3$sale_price)
## [1] 36060.21

You can also code all interactions by formula.

  • Easy if you have only quantitative variables
  • Lets use train_mr1 (no categorical variables)
  • This produces a sizable reduction in our estimate of test error
mr3_int3 <- train(sale_price ~ .^2, data = train_mr1,
            method = "glm", trControl = trn_ctrl)

coef(mr3_int3$finalModel)
##                (Intercept)                gr_liv_area                   lot_area 
##               1.718428e+05              -1.212108e+02               2.000489e+01 
##                 year_built               overall_qual                garage_cars 
##              -2.235404e+01              -1.852575e+05               8.716401e+04 
##     `gr_liv_area:lot_area`   `gr_liv_area:year_built` `gr_liv_area:overall_qual` 
##              -2.643298e-03               4.959769e-02               1.595986e+01 
##  `gr_liv_area:garage_cars`      `lot_area:year_built`    `lot_area:overall_qual` 
##               1.668575e+00              -8.768245e-03               2.640634e-01 
##     `lot_area:garage_cars`  `year_built:overall_qual`   `year_built:garage_cars` 
##               8.062370e-01               8.575310e+01              -6.184345e+01 
## `overall_qual:garage_cars` 
##               6.425294e+03
rmse_train_mr3_int3 <- RMSE(predict(mr3_int3, train_mr1), train_mr1$sale_price)
rmse_test_mr3_int3 <- RMSE(predict(mr3_int3, test), test$sale_price)

rmse <- bind_rows(rmse, tibble(model = "multiple regression w/all quant interactions", rmse_train = rmse_train_mr3_int3, rmse_test = rmse_test_mr3_int3))
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26
multiple regression w/focal interaction 36060.21 40673.83
multiple regression w/all quant interactions 32212.91 31747.06

You can also do this with categorical variables in the model

  • Below, we demonstrate this using train_mr2

  • This method requires that you have contrasts coded for the categorical interactions rather than explicit dummy variables (Why?)

  • This will often not work well if there are many empty cells among crossings of categorical variables

    • Note the NAs for many coefficients of categorical interaction terms
mr3_int4 <- train(sale_price ~ .^2, data = train_mr3_contrasts,
            method = "glm", trControl = trn_ctrl)

coef(mr3_int4$finalModel)
##                (Intercept)                gr_liv_area                   lot_area 
##               3.931392e+06               1.493084e+02               3.485288e+01 
##                 year_built               overall_qual                garage_cars 
##              -2.033793e+03              -6.889771e+04              -1.539193e+04 
##                 ms_zoning2                 ms_zoning3                 ms_zoning4 
##              -1.009947e+07              -5.139194e+06              -4.460264e+06 
##                 ms_zoning5                lot_config2                lot_config3 
##              -4.961807e+06              -6.335447e+05               3.367930e+05 
##                lot_config4                lot_config5                 bldg_type2 
##               5.324519e+05               1.166233e+05              -1.957636e+05 
##                 bldg_type3                 bldg_type4                 bldg_type5 
##              -9.585297e+05               3.541480e+06              -1.439545e+05 
##     `gr_liv_area:lot_area`   `gr_liv_area:year_built` `gr_liv_area:overall_qual` 
##              -3.046236e-03              -1.396808e-01               1.780928e+01 
##  `gr_liv_area:garage_cars`   `gr_liv_area:ms_zoning2`   `gr_liv_area:ms_zoning3` 
##               8.740158e-01               1.348914e+02               1.623326e+02 
##   `gr_liv_area:ms_zoning4`   `gr_liv_area:ms_zoning5`  `gr_liv_area:lot_config2` 
##               1.084375e+02               8.779207e+01               2.036762e+01 
##  `gr_liv_area:lot_config3`  `gr_liv_area:lot_config4`  `gr_liv_area:lot_config5` 
##               2.816908e+00               7.813031e+01              -1.627033e+01 
##   `gr_liv_area:bldg_type2`   `gr_liv_area:bldg_type3`   `gr_liv_area:bldg_type4` 
##               1.662099e+01               3.844298e+00               4.711557e+00 
##   `gr_liv_area:bldg_type5`      `lot_area:year_built`    `lot_area:overall_qual` 
##               2.398514e+01              -1.792369e-03              -2.064597e-01 
##     `lot_area:garage_cars`      `lot_area:ms_zoning2`      `lot_area:ms_zoning3` 
##               1.799965e+00              -2.973522e+01              -2.851961e+01 
##      `lot_area:ms_zoning4`      `lot_area:ms_zoning5`     `lot_area:lot_config2` 
##              -2.684801e+01              -2.571080e+01               2.004276e-01 
##     `lot_area:lot_config3`     `lot_area:lot_config4`     `lot_area:lot_config5` 
##              -4.631749e-01              -5.649315e+00              -8.790018e-01 
##      `lot_area:bldg_type2`      `lot_area:bldg_type3`      `lot_area:bldg_type4` 
##              -1.230888e+00               1.565751e-01               2.872090e+01 
##      `lot_area:bldg_type5`  `year_built:overall_qual`   `year_built:garage_cars` 
##               6.598574e+00               2.303124e+01               3.259434e+00 
##    `year_built:ms_zoning2`    `year_built:ms_zoning3`    `year_built:ms_zoning4` 
##               5.258187e+03               2.721924e+03               2.375213e+03 
##    `year_built:ms_zoning5`   `year_built:lot_config2`   `year_built:lot_config3` 
##               2.653688e+03               3.131233e+02              -1.823959e+02 
##   `year_built:lot_config4`   `year_built:lot_config5`    `year_built:bldg_type2` 
##              -3.126781e+02              -1.062299e+02               8.845717e+01 
##    `year_built:bldg_type3`    `year_built:bldg_type4`    `year_built:bldg_type5` 
##               5.223817e+02              -1.829233e+03               3.769830e+01 
## `overall_qual:garage_cars`  `overall_qual:ms_zoning2`  `overall_qual:ms_zoning3` 
##               7.828153e+03              -1.259093e+04              -7.779410e+03 
##  `overall_qual:ms_zoning4`  `overall_qual:ms_zoning5` `overall_qual:lot_config2` 
##               3.327673e+03              -4.206880e+03               1.403645e+03 
## `overall_qual:lot_config3` `overall_qual:lot_config4` `overall_qual:lot_config5` 
##              -1.180189e+03                         NA               5.919450e+03 
##  `overall_qual:bldg_type2`  `overall_qual:bldg_type3`  `overall_qual:bldg_type4` 
##              -2.886776e+03              -1.423911e+04              -1.682317e+03 
##  `overall_qual:bldg_type5`   `garage_cars:ms_zoning2`   `garage_cars:ms_zoning3` 
##               5.408636e+03              -5.820519e+04              -2.515368e+04 
##   `garage_cars:ms_zoning4`   `garage_cars:ms_zoning5`  `garage_cars:lot_config2` 
##              -4.240034e+04              -3.378025e+04              -1.206981e+04 
##  `garage_cars:lot_config3`  `garage_cars:lot_config4`  `garage_cars:lot_config5` 
##               1.060910e+04                         NA               2.998137e+03 
##   `garage_cars:bldg_type2`   `garage_cars:bldg_type3`   `garage_cars:bldg_type4` 
##               1.553025e+03              -1.040857e+04               2.959597e+03 
##   `garage_cars:bldg_type5`   `ms_zoning2:lot_config2`   `ms_zoning3:lot_config2` 
##              -2.521286e+03               8.471734e+04                         NA 
##   `ms_zoning4:lot_config2`   `ms_zoning5:lot_config2`   `ms_zoning2:lot_config3` 
##                         NA                         NA               1.006448e+04 
##   `ms_zoning3:lot_config3`   `ms_zoning4:lot_config3`   `ms_zoning5:lot_config3` 
##                         NA               7.301991e+03                         NA 
##   `ms_zoning2:lot_config4`   `ms_zoning3:lot_config4`   `ms_zoning4:lot_config4` 
##                         NA                         NA                         NA 
##   `ms_zoning5:lot_config4`   `ms_zoning2:lot_config5`   `ms_zoning3:lot_config5` 
##                         NA               7.684834e+04               6.400559e+04 
##   `ms_zoning4:lot_config5`   `ms_zoning5:lot_config5`    `ms_zoning2:bldg_type2` 
##               8.548702e+04               8.125317e+04                         NA 
##    `ms_zoning3:bldg_type2`    `ms_zoning4:bldg_type2`    `ms_zoning5:bldg_type2` 
##              -7.913372e+04               1.979847e+04               1.885870e+04 
##    `ms_zoning2:bldg_type3`    `ms_zoning3:bldg_type3`    `ms_zoning4:bldg_type3` 
##                         NA              -5.327203e+04              -8.558187e+03 
##    `ms_zoning5:bldg_type3`    `ms_zoning2:bldg_type4`    `ms_zoning3:bldg_type4` 
##                         NA               2.187712e+04                         NA 
##    `ms_zoning4:bldg_type4`    `ms_zoning5:bldg_type4`    `ms_zoning2:bldg_type5` 
##               7.009452e+03                         NA              -3.236275e+04 
##    `ms_zoning3:bldg_type5`    `ms_zoning4:bldg_type5`    `ms_zoning5:bldg_type5` 
##               3.611213e+04              -2.387957e+04                         NA 
##   `lot_config2:bldg_type2`   `lot_config3:bldg_type2`   `lot_config4:bldg_type2` 
##                         NA                         NA                         NA 
##   `lot_config5:bldg_type2`   `lot_config2:bldg_type3`   `lot_config3:bldg_type3` 
##              -4.116650e+02               1.378558e+04               2.096798e+03 
##   `lot_config4:bldg_type3`   `lot_config5:bldg_type3`   `lot_config2:bldg_type4` 
##                         NA               4.689846e+03                         NA 
##   `lot_config3:bldg_type4`   `lot_config4:bldg_type4`   `lot_config5:bldg_type4` 
##              -8.086970e+03                         NA                         NA 
##   `lot_config2:bldg_type5`   `lot_config3:bldg_type5`   `lot_config4:bldg_type5` 
##              -3.071098e+04              -5.809059e+03                         NA 
##   `lot_config5:bldg_type5` 
##              -9.372096e+03

  • Not surprisingly, this is an example where it doesnt work well as its overfit to the training set
rmse_train_mr3_int4 <- RMSE(predict(mr3_int4, train_mr3_contrasts), train_mr3_contrasts$sale_price)
rmse_test_mr3_int4 <- RMSE(predict(mr3_int4, test), test$sale_price)

rmse <- bind_rows(rmse, tibble(model = "multiple regression w/all interactions", rmse_train = rmse_train_mr3_int4, rmse_test = rmse_test_mr3_int4))
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26
multiple regression w/focal interaction 36060.21 40673.83
multiple regression w/all quant interactions 32212.91 31747.06
multiple regression w/all interactions 29716.56 39941.31

2.2.5 Extension to Non-linear Models

We may also want to model non-linear effects of our predictors

  • There are many models that can be used for non-linear effects (e.g., support vector machines with radial basis function kernel)

  • Non-parametric models are good for this (e.g., KNN, Random Forest).

  • Non-linear effects can be accommodated in the GLM by:

    • Transformations of Y or X (more on this in later units)
    • Ordinal predictors can be coded with dummy variables
    • Quantitative variables can be split at threshold
    • Polynomial contrasts for quantitative or categorical predictors (see \(poly()\))
  • We will continue to explore these options throughout the course


2.3 Combining Sets of Predictors Using \(model.matix()\)

Given our results to date, we might want to make a feature set that includes all 2-way interactions among our quantitative predictors plus additive effects of the categorical predictors

We can use model.matrix() to generate regressors for X for different formulas

Lets start from the beginning for this given what we have learned:

  • First, open the full data set and select down to the predictors we care about
ames <- read_csv(file.path(data_path, "ames.csv")) %>% 
  clean_names(case = "snake") %>% 
  select(-id) %>% 
  select(sale_price, 
         gr_liv_area, 
         lot_area, 
         year_built, 
         overall_qual, 
         garage_cars, 
         ms_zoning,         
         lot_config,
         bldg_type) 

  • Next, extract the quantitative predictors
  • Remove the outcome variable
  • Construct a tibble with all 2-way interactions among them
  • Remove intercept
quan <- ames %>% 
  select(-sale_price) %>% 
  select_if(is.numeric)

quan <- as_tibble(model.matrix(~ .^2, data = quan), .name_repair = "universal") %>% 
  select(-1) %>% 
  glimpse()
## Observations: 1,460
## Variables: 15
## $ gr_liv_area              <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 1…
## $ lot_area                 <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 61…
## $ year_built               <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 1…
## $ overall_qual             <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5…
## $ garage_cars              <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2…
## $ gr_liv_area.lot_area     <dbl> 14449500, 12115200, 20092500, 16397350, 31343480, 19224…
## $ gr_liv_area.year_built   <dbl> 3425130, 2493712, 3573786, 3288055, 4396000, 2714466, 3…
## $ gr_liv_area.overall_qual <dbl> 11970, 7572, 12502, 12019, 17584, 6810, 13552, 14630, 1…
## $ gr_liv_area.garage_cars  <dbl> 3420, 2524, 3572, 5151, 6594, 2724, 3388, 4180, 3548, 1…
## $ lot_area.year_built      <dbl> 16925350, 18969600, 22511250, 18288250, 28520000, 28131…
## $ lot_area.overall_qual    <dbl> 59150, 57600, 78750, 66850, 114080, 70575, 80672, 72674…
## $ lot_area.garage_cars     <dbl> 16900, 19200, 22500, 28650, 42780, 28230, 20168, 20764,…
## $ year_built.overall_qual  <dbl> 14021, 11856, 14007, 13405, 16000, 9965, 16032, 13811, …
## $ year_built.garage_cars   <dbl> 4006, 3952, 4002, 5745, 6000, 3986, 4008, 3946, 3862, 1…
## $ overall_qual.garage_cars <dbl> 14, 12, 14, 21, 24, 10, 16, 14, 14, 5, 5, 27, 5, 21, 6,…

  • Extract the categorical predictors and code dummy variables
    • Converted to factors first so that \(sep\) would work
dmy <- ames %>%
  select_if(is.character) %>% 
  mutate_all(as.factor)

dmy_codes <- dummyVars(~ ., fullRank = TRUE, sep = "_", data = dmy)

dmy <- as_tibble(predict(dmy_codes, dmy)) %>% 
  glimpse()
## Observations: 1,460
## Variables: 12
## $ ms_zoning_FV       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ms_zoning_RH       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ms_zoning_RL       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
## $ ms_zoning_RM       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ lot_config_CulDSac <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ lot_config_FR2     <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lot_config_FR3     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lot_config_Inside  <dbl> 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0…
## $ bldg_type_2fmCon   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bldg_type_Duplex   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ bldg_type_Twnhs    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ bldg_type_TwnhsE   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

  • Combine outcome, quan, and dmy
    • Note use of ames[“sale_price”] vs. ames$sale_price

    • Split into train and test

all <- bind_cols(ames["sale_price"], quan, dmy)

set.seed(19690127)
splits <- initial_split(all, prop = .8, strata = "sale_price")

train <- training(splits)
test <- testing(splits)

  • Fit model
    • can use ‘.
  • Evaluate model
mr4 <- train(sale_price ~ ., data = train,
            method = "glm", trControl = trn_ctrl)

rmse_train_mr4 <- RMSE(predict(mr4), train$sale_price)
rmse_test_mr4 <- RMSE(predict(mr4, test), test$sale_price)

rmse <- bind_rows(rmse, tibble(model = "best? multiple regression", rmse_train = rmse_train_mr4, rmse_test = rmse_test_mr4))
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26
multiple regression w/focal interaction 36060.21 40673.83
multiple regression w/all quant interactions 32212.91 31747.06
multiple regression w/all interactions 29716.56 39941.31
best? multiple regression 31417.38 31045.39

Its a big improvement from where we started with very little effort/thought/exploration

I expect you can do better (and you will get the chance!)


2.3.1 Other Regression Performance Metrics

Average RMSE of about 31K dollars is OK. At least you have a pretty good sense of how well this model will work for prediction

What is this model’s \(R^2\)? What about its mean absolute error (MAE)? Which metric is most descriptive to you?

  • MAE is likely most descriptive. RMSE has a conceptual link to the parameter estimation procedure for GLM (but not for other approaches like KNN). \(R^2\) does seem that descriptively useful to me because I have no sense of it without doing intermediate calculations from MSE and variance of Y. See next page for values

metrics <- postResample(predict(mr4, test), test$sale_price)

round(metrics["RMSE"], 0)
##  RMSE 
## 31045
round(metrics["Rsquared"], 3)
## Rsquared 
##    0.806
round(metrics["MAE"], 0)
##   MAE 
## 22868

\(RMSE = \sqrt\frac{\sum_1^N (\hat{y}_i - y_i)^2}{N}\)

\(MAE = \frac{\sum_1^N |\hat{y}_i - y_i|}{N}\)

\(R^2 = cor(\hat{y}_i, y_i)^2\)

NOTE: \(R^2\) is calculated by caret by computing the correlation between the observed and predicted values (i.e. R) and squaring the value. When the model is poor, this can lead to differences between this estimator and the more widely known estimate derived form linear regression models.

\(R^2 = 1-\frac{\sum_1^N(y_i - \hat{y}_i)^2}{\sum_1^N(y_i - \bar{y}_i)^2}\)

Mostly notably, the correlation approach will not generate negative values of \(R^2\) (which are theoretically invalid). A comparison of these and other estimators can be found in Kvålseth (1985).


2.4 KNN Regression

K Nearest Neighbour

  • Is a non-parametric regression and classification model

  • Very simple but also powerful (list commonly among top 10 algorithms)

  • Algorithm “memorizes” the training set (lazy learning)

    • Lazy learning is most useful for large, continuously changing datasets with few attributes (predictors) that are commonly queried (e.g., online recommendation systems)
  • Prediction for any new observation is based on K most similar observations from the dataset


A simple example with simulated data from linear DGP of one predictor

Includes:

  • DGP
  • Simple linear model
  • Red lines to represent three new observations (X = 10, 50, 90) to generate predictions via KNN
  • What would 5-NN predictions look like for each of these three new values of X?


KNN can easily accommodate non-linear relationships between predictors and outcomes

In fact, it can flexibly handle ANY shape of relationship

ggplot(data = demo, aes(x = x, y = y3)) +
  geom_line(aes(x = x, y = y3_dgp, color = "blue"), size = 1.5) +
  geom_smooth(aes(color = "green"), method='lm',formula=y ~ x,  se = FALSE) +
  geom_vline(xintercept = 10, color = "red", size = 1.5)+
  geom_vline(xintercept = 50, color = "red", size = 1.5)+
  geom_vline(xintercept = 90, color = "red", size = 1.5) +
  geom_point(aes(y = y3), color = "black") +
  scale_color_identity(name = "Model",
                     breaks = c("blue", "green"),
                     labels = c("DGP", "linear GLM"),
                     guide = "legend")


2.4.1 The hyperparameter k

KNN is our first example of a statistical model that includes a model hyperparameter, in this case ‘k’

  • Model hyperparameters differ from parameters in that they cannot be estimated while fitting the model to the training dataset

  • They must be set in advance

k = 5 is the default for knn in caret


Using the quadratic demo, 5-NN yields these predictions

knn_y2 <- train(y2 ~ x, data = demo,
            method = "knn", trControl = trn_ctrl)

knn_y2$bestTune
##   k
## 1 5
demo$y2_hat_k5 <- predict(knn_y2, demo)

ggplot(data = demo, aes(x = x, y = y2)) +
  geom_line(aes(x = x, y = y2_dgp, color = "blue"), size = 1.5) +
  geom_line(aes(x = x, y = y2_hat_k5, color = "red"), size = 1.5) +
  geom_point(aes(y = y2), color = "black") +
  scale_color_identity(name = "Model",
                     breaks = c("blue", "red"),
                     labels = c("DGP", "k = 5"),
                     guide = "legend")


How will the size of k influence model performance (e.g., bias, variance)?

  • Smaller values of k will tend to increase variance but decrease bias. Larger values of k will tend to decrease variance but increase bias. Need to find the sweet spot

How will k = 1 perform in training and test sets?

  • k = 1 will perfectly fit the training set. Therefore it is very dependent on the training set (high variance). Clearly it will likely not do as well in test (it will be overfit to training). K needs to be larger if there is more noise (to average over more cases). K needs to be smaller if the relationships are complex. (More on choosing k by cross validation in later unit.)

k = 1

tune_grid <- crossing(k = 1)
tune_grid
## # A tibble: 1 x 1
##       k
##   <dbl>
## 1     1
knn_y2_k1 <- train(y2 ~ x, data = demo,
            method = "knn", trControl = trn_ctrl, tuneGrid = tune_grid)

demo$y2_hat_k1 <- predict(knn_y2_k1, demo)

k = 75

tune_grid <- crossing(k = 75)

knn_y2_k75 <- train(y2 ~ x, data = demo,
            method = "knn", trControl = trn_ctrl, tuneGrid = tune_grid)

demo$y2_hat_k75 <- predict(knn_y2_k75, demo)


2.4.2 Defining “Nearest”

To make a prediction for some new observation, we need to identify the observations from the training set that are nearest to it

  • Need a distance measure to define “nearest”

  • There are a number of different distance measures available (e.g., Euclidean, Cosine, Chi square, Minkowsky)

  • Euclidean is most commonly used in KNN (and what is used by knn in caret)

  • IMPORTANT: We care only about:

    • Distance between a test observation and all the training observations
    • Need to find the k observation in training that are nearest to the test observation
    • Distance is defined based on these observations features (predictors), not their outcomes

Euclidean distance between any two points is an n-dimensional extension of the Pythagorean formula (which applies explicitly with 2 predictors/2 dimensional space).

\(C^2 = A^2 + B^2\)

\(C = \sqrt{A^2 + B^2}\)

…where C is the distance between two points


The distance between 2 points (p and q) in two dimensions (2 predictors, x1 = A, x2 = B)

\(C = \sqrt{A^2 + B^2}\)

\(Distance = \sqrt{(q1 - p1)^2 + (q2 - p2)^2}\)

\(Distance = \sqrt{(2 - 1)^2 + (5 - 2)^2}\)

\(Distance = 3.2\)


One dimensional (one predictor) is simply the subtraction of scores on that predictor (x1) between p and q

\(Distance = \sqrt{(q1 - p1)^2}\)

\(Distance = \sqrt{(2 - 1)^2}\)

\(Distance = 1\)

N-dimensional generalization for n predictors:

\(Distance = \sqrt{(q1 - p1)^2 + (q2 - p2)^2 + ... + (qn - pn)^2}\)


2.4.3 Normalization of X

Distance is dependent on scales of all the predictors

  • We need to put all on the same scale

  • For quantitative variables: typically range correct to put on [0, 1] scale. For any x:

\[ x = \frac{x_i - min(x)}{max(x) - min(x)}\]

  • KNN accommodates categorical variables as dummy variables
    • Dummy variables are already on this scale so no need normalize or change scale in other way

2.4.4 KNN with Ames Housing Prices

Lets first start with fresh load of ames and splits

ames <- read_csv(file.path(data_path, "ames.csv")) %>% 
  clean_names(case = "snake") %>% 
  select(-id) %>% 
  select(sale_price, 
         gr_liv_area, 
         lot_area, 
         year_built, 
         overall_qual, 
         garage_cars,
         ms_zoning,
         lot_config,
         bldg_type) %>% 
  mutate(ms_zoning = as.factor(ms_zoning),
         lot_config = as.factor(lot_config),
         bldg_type = as.factor(bldg_type))

contrasts(ames$ms_zoning) <- contr.treatment(length(levels(ames$ms_zoning)))
contrasts(ames$lot_config) <- contr.treatment(length(levels(ames$lot_config)))
contrasts(ames$bldg_type) <- contr.treatment(length(levels(ames$bldg_type)))

#Option to add explicit dummy codes instead of setting contrasts
# dmy_codes <- dummyVars(~ ms_zoning + lot_config + bldg_type, fullRank = TRUE, sep = "_", data = ames)
# ames <- ames %>% 
#   bind_cols(as_tibble(predict(dmy_codes, ames))) %>% 
#   select_if(is.numeric)

set.seed(19690127)
ames_splits <- initial_split(ames, prop = .8, strata = "sale_price")
train_knn <- training(ames_splits)
test_knn <- testing(ames_splits)

Normalize all quantitative variables to [0,1]

  • Dummy variables dont need to be normalized b/c they are already in the range of [0, 1]

  • All statistics for transformations need to be based on training dataset

  • In this case, that is the min and max of each variable that will be normalized

  • This same normalization formula will then be applied to test using min and max based on train

normalize <- preProcess(train_knn, method = "range")

normalize
## Created from 1170 samples and 9 variables
## 
## Pre-processing:
##   - ignored (3)
##   - re-scaling to [0, 1] (6)
train_knn_norm <- predict(normalize, train_knn) %>% 
  mutate(sale_price = train_knn$sale_price) %>% 
  glimpse()
## Observations: 1,170
## Variables: 9
## $ sale_price   <dbl> 208500, 223500, 143000, 307000, 129900, 118000, 129500, 144000, 157…
## $ gr_liv_area  <dbl> 0.3169047, 0.3344081, 0.2367573, 0.3132197, 0.3316444, 0.1711193, 0…
## $ lot_area     <dbl> 0.04376836, 0.06090842, 0.07844638, 0.05377081, 0.02950539, 0.03746…
## $ year_built   <dbl> 0.9492754, 0.9347826, 0.8768116, 0.9565217, 0.4275362, 0.4855072, 0…
## $ overall_qual <dbl> 0.6666667, 0.6666667, 0.4444444, 0.7777778, 0.6666667, 0.4444444, 0…
## $ garage_cars  <dbl> 0.50, 0.50, 0.50, 0.50, 0.50, 0.25, 0.25, 0.25, 0.25, 0.50, 0.50, 0…
## $ ms_zoning    <fct> RL, RL, RL, RL, RM, RL, RL, RL, RL, RM, RL, RL, RL, RL, RM, RL, RM,…
## $ lot_config   <fct> Inside, Inside, Inside, Inside, Inside, Corner, Inside, Inside, Cor…
## $ bldg_type    <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 2fmCon, 1Fam, 1Fam, 1Fam, 1Fam, Duple…
test_knn_norm <- predict(normalize, test_knn) %>% 
  mutate(sale_price = test_knn$sale_price) %>% 
  glimpse()
## Observations: 290
## Variables: 9
## $ sale_price   <dbl> 181500, 140000, 250000, 200000, 345000, 279500, 149000, 277500, 239…
## $ gr_liv_area  <dbl> 0.21372639, 0.31851681, 0.42929526, 0.40442193, 0.45831414, 0.26715…
## $ lot_area     <dbl> 0.05080803, 0.05050196, 0.07933399, 0.05559500, 0.06503428, 0.05724…
## $ year_built   <dbl> 0.7536232, 0.3115942, 0.9275362, 0.7318841, 0.9637681, 0.9710145, 0…
## $ overall_qual <dbl> 0.5555556, 0.6666667, 0.7777778, 0.6666667, 0.8888889, 0.6666667, 0…
## $ garage_cars  <dbl> 0.50, 0.75, 0.75, 0.50, 0.75, 0.75, 0.50, 0.50, 0.50, 0.50, 0.50, 0…
## $ ms_zoning    <fct> RL, RL, RL, RL, RL, RL, RL, RL, RL, FV, RL, RL, RL, RL, RL, RL, RM,…
## $ lot_config   <fct> FR2, Corner, FR2, Corner, Inside, Inside, CulDSac, Inside, CulDSac,…
## $ bldg_type    <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, 1Fam, Twnhs, 1Fam…

Why do we need to use only training to calculate the min and max for normalization?

  • The model never ‘sees’ the test data because those are novel data so nothing about the model (even data pre-processing can depend on test data). Critical to enforce this or you will get bleed through into the test set and your estimate of model performance in new data may be overly optimistic!

But earlier, we calculated dummy codes using the full data before splitting into train/test. Why is that not a problem?

  • The dummy codes in train dont change based on the data in test. Therefore, the model calculated in train isnt biased to better fit the test because of this. We do this so that train and test both have all the same dummy variables based on all possible levels of the categorical variable. However, if a dummy level doesnt exist in one of the sets, its dummy associated variable will be constant and not useful in that dataset.

Train a model using only quantitative variables

  • k = 5
  • NOTE: Still using same trn_ctrl as earlier with glm
knn5_quant <- train(sale_price ~ gr_liv_area + lot_area + year_built + overall_qual + garage_cars, data = train_knn_norm,
            method = "knn", trControl = trn_ctrl)

knn5_quant$bestTune
##   k
## 1 5
rmse_train_knn5_quant <- RMSE(predict(knn5_quant, train_knn_norm), train_knn_norm$sale_price)
rmse_test_knn5_quant <- RMSE(predict(knn5_quant, test_knn_norm), test_knn_norm$sale_price)
rmse <- bind_rows(rmse, tibble(model = "knn w/quant , k = 5", rmse_train = rmse_train_knn5_quant, rmse_test = rmse_test_knn5_quant))
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26
multiple regression w/focal interaction 36060.21 40673.83
multiple regression w/all quant interactions 32212.91 31747.06
multiple regression w/all interactions 29716.56 39941.31
best? multiple regression 31417.38 31045.39
knn w/quant , k = 5 28783.70 38973.74

It is often useful (for any model) to compare predicted to observed outcome

Do this in training or you run the risk of changing your model to fit test (or otherwise, use yet another test. More on this later)

sp <- tibble(predicted = predict(knn5_quant, train_knn_norm),
             observed = train_knn_norm$sale_price)
ggplot(data = sp, aes(x = predicted, y = observed)) +
  geom_point()


Try k = 20 & 50

  • NOTE use of tune_grid in \(train()\)
#k = 20
tune_grid <- expand.grid(k = 20)
knn20_quant <- train(sale_price ~ gr_liv_area + lot_area + year_built + overall_qual + garage_cars, data = train_knn_norm,
            method = "knn", trControl = trn_ctrl , tuneGrid = tune_grid)

rmse_train_knn20_quant <- RMSE(predict(knn20_quant, train_knn_norm), train_knn_norm$sale_price)
rmse_test_knn20_quant <- RMSE(predict(knn20_quant, test_knn_norm), test_knn_norm$sale_price)
rmse <- bind_rows(rmse, tibble(model = "knn w/quant , k = 20", rmse_train = rmse_train_knn20_quant, rmse_test = rmse_test_knn20_quant))

#k = 50
tune_grid <- expand.grid(k = 50)
knn50_quant <- train(sale_price ~ gr_liv_area + lot_area + year_built + overall_qual + garage_cars, data = train_knn_norm,
            method = "knn", trControl = trn_ctrl , tuneGrid = tune_grid)

rmse_train_knn50_quant <- RMSE(predict(knn50_quant, train_knn_norm), train_knn_norm$sale_price)
rmse_test_knn50_quant <- RMSE(predict(knn50_quant, test_knn_norm), test_knn_norm$sale_price)
rmse <- bind_rows(rmse, tibble(model = "knn w/quant , k = 50", rmse_train = rmse_train_knn50_quant, rmse_test = rmse_test_knn50_quant))

What is going on with train and test error across the three levels of k we considered?

  • As k increases, the models fit worse in training but better in test. This is because the lower k values are overfit to training. There is some bias introduced as k increases but its overcome by the reduction in variance.
kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26
multiple regression w/focal interaction 36060.21 40673.83
multiple regression w/all quant interactions 32212.91 31747.06
multiple regression w/all interactions 29716.56 39941.31
best? multiple regression 31417.38 31045.39
knn w/quant , k = 5 28783.70 38973.74
knn w/quant , k = 20 34056.31 35501.02
knn w/quant , k = 50 38649.48 34954.15

We can add the categorical variables to the model

We can use ‘.’ because now using all predictors in train_knn_norm

#k = 5
knn5_all <- train(sale_price ~ ., data = train_knn_norm,
            method = "knn", trControl = trn_ctrl)

rmse_train_knn5_all <- RMSE(predict(knn5_all, train_knn_norm), train_knn_norm$sale_price)
rmse_test_knn5_all <- RMSE(predict(knn5_all, test_knn_norm), test_knn_norm$sale_price)
rmse <- bind_rows(rmse, tibble(model = "knn w/all , k = 5", rmse_train = rmse_train_knn5_all, rmse_test = rmse_test_knn5_all))

#k = 20
tune_grid <- expand.grid(k = 20)
knn20_all <- train(sale_price ~ ., data = train_knn_norm,
            method = "knn", trControl = trn_ctrl , tuneGrid = tune_grid)

rmse_train_knn20_all <- RMSE(predict(knn20_all, train_knn_norm), train_knn_norm$sale_price)
rmse_test_knn20_all <- RMSE(predict(knn20_all, test_knn_norm), test_knn_norm$sale_price)
rmse <- bind_rows(rmse, tibble(model = "knn w/all , k = 20", rmse_train = rmse_train_knn20_all, rmse_test = rmse_test_knn20_all))

#k = 50
tune_grid <- expand.grid(k = 50)
knn50_all <- train(sale_price ~ ., data = train_knn_norm,
            method = "knn", trControl = trn_ctrl , tuneGrid = tune_grid)

rmse_train_knn50_all <- RMSE(predict(knn50_all, train_knn_norm), train_knn_norm$sale_price)
rmse_test_knn50_all <- RMSE(predict(knn50_all, test_knn_norm), test_knn_norm$sale_price)
rmse <- bind_rows(rmse, tibble(model = "knn w/all , k = 50", rmse_train = rmse_train_knn50_all, rmse_test = rmse_test_knn50_all))

kable(rmse)
model rmse_train rmse_test
simple regression 56228.15 55871.92
multiple regression 38663.97 40789.77
multiple regression w/categorical 37700.56 39958.26
multiple regression w/focal interaction 36060.21 40673.83
multiple regression w/all quant interactions 32212.91 31747.06
multiple regression w/all interactions 29716.56 39941.31
best? multiple regression 31417.38 31045.39
knn w/quant , k = 5 28783.70 38973.74
knn w/quant , k = 20 34056.31 35501.02
knn w/quant , k = 50 38649.48 34954.15
knn w/all , k = 5 30567.56 38193.47
knn w/all , k = 20 41078.22 34963.19
knn w/all , k = 50 47942.01 37887.09

2.5 Programming Tips

caret package:

  • \(train()\)
  • \(RMSE()\)
  • \(predict()\)
  • \(dummyVars()\)
  • \(preProcess(., method = range)\)

rsample package:

  • \(initial_split()\)
  • \(training()\)
  • \(testing()\)

Tidyverse:

  • \(select()\)
  • \(rename()\)
  • \(filter()\)
  • \(arrange()\)
  • \(mutate()\)
  • \(group_by()\)
  • \(summarise()\)