Unit 2: EDA Cleaning




January 31, 2024

Packages, source functions, conflicts

This code chunk sets up conflict policies to reduce errors associated with function conflicts

options(conflicts.policy = "depends.ok") # deals with package conflicts
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true") # source functions
tidymodels_conflictRules() # deals with package conflicts

This code chunk loads all packages needed for this assignment.


You may also use the EDA and plotting functions that John shares on his lab_support repo. You can source the scripts that contain those functions directly from Github with the code below (note that you may need to install the devtools package if you haven’t done this previously).


Set up some other environment settings

options(tibble.width = Inf, tibble.print_max = Inf)

Read and Setup Dataframe

Read Data

In the chunk below, set the variable path_data to the location of your data files. Make sure you have your iaml project open in RStudio. When you call here::here() it will set your root path to be inside of the iaml folder. Assuming you have a subfolder called homework and a folder within that folder called unit_02, path_data will work as set. If you have some other organization, you will need to modify path_data to reflect that folder structure.

path_data <- "homework/unit_02"

This assignment will use the Ames Housing Prices Dataset (also seen in Unit 2 of the course text).

Read in the ames_raw_class.csv data below.

ames_all <- read_csv(here::here(path_data, "ames_raw_class.csv"), 
                     col_types = cols()) |> 
Select variables

We explore a different set of variables than those demoed in the course text. select() from the dataset for the variables below and convert all variable names to “snake_case”:

select() predictors for this assignment and convert all variable names to snake_case:

SalePrice, Garage Area, Neighborhood, MS SubClass, Total Bsmt SF, Bsmt Qual, Central Air, TotRms AbvGrd, Fireplaces, and Fireplace Qu.

Notice use of back-tick in the following code chunk for non-standard variable names (i.e., names that aren’t machine readable due to having a space in them).

ames_all <- ames_all |> 
         `Garage Area`, 
         `MS SubClass`,
         `Total Bsmt SF`, 
         `Bsmt Qual`,
         `Central Air`,
         `TotRms AbvGrd`,
         `Fireplace Qu`) |> 
  janitor::clean_names("snake") |> 
  mutate(across(where(is.character), factor)) |>
Review the data dictionary

Familiarize yourself with the variables we use above by looking each one up in the data dictionary downloaded with your homework files. Reference the codebook frequently as you perform cleaning checks below.

Exploring the data for cleaning

This script should only contain EDA steps necessary for cleaning the full dataset (i.e., not subsets of it that will be allocated for train, validation, or test). Be mindful of which aspects of the data set you explore at this stage to prevent information leakage between later training and validation sets (to be saved out as the final step of cleaning in this document). Provide observations at each stage of the process, even if you do not make any changes.

Remember that you can use glimpse(), skim_some(), print(), head(), and/or kable tables to explore and display your data. Everything that you need to do has been demonstrated in the web book.

Variable classes

Confirm that all variables are read in as the expected class. Remember that we class nominal and ordinal variables as factors and interval and ratio variables as numeric. Below the code chunk, type some observations you have about the observed variable classes compared to descriptions in the data dictionary. Make any appropriate adjustments to variable class below using mutate(), factor(), or as.numeric(). We did this above when selecting variables

ames_all |> 
Variable class notes: All variables are either of type factor or of type numeric. We will think more carefully about the factors, etc. in the eda_modeling script. The classes as they were read in match the codebook. Though ms_sub_class is a character variable containing numeric values, the codebook tells us that these numeric codes are nominal values representing different classes of dwellings, indicating that this variable is best kept as factor.

Missing data

Clearly document missing data (“missingness”) across each variable in the dataset. For variables with high missingness, write code that allows you to visually inspect all observations of missing data. Clean variables with high missingness using mutate(), replace_na(), and fct_relevel() if you believe any of the NAs are not really missing but instead problems with how the data were coded.

ames_all |> 
  skim_some() |> 
  select(skim_variable, n_missing, complete_rate) # view missing variables
ames_all |> filter(is.na(bsmt_qual)) |>
# Convert NAs to "none" where stated in codebook
ames_all <- ames_all |> 
  mutate(bsmt_qual = fct_expand(bsmt_qual, "no_bsmt"), # add a new level to the factor
         bsmt_qual = replace_na(bsmt_qual, "no_bsmt"), # recode NA to that new level
         fireplace_qu = fct_expand(fireplace_qu, "no_fireplace"), 
         fireplace_qu = replace_na(fireplace_qu, "no_fireplace"))

# review order of levels b/c they are ordinal, no fireplace/basement is worst
ames_all$bsmt_qual |> levels()
[1] "Ex"      "Fa"      "Gd"      "TA"      "no_bsmt"
ames_all$fireplace_qu |> levels()
[1] "Ex"           "Fa"           "Gd"           "Po"           "TA"          
[6] "no_fireplace"
bq_levels <- c("no_bsmt", "Po", "Fa", "TA", "Gd", "Ex") 
ames_all <- ames_all |> 
  mutate(bsmt_qual = forcats::fct_relevel(bsmt_qual, bq_levels)) 
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `bsmt_qual = forcats::fct_relevel(bsmt_qual, bq_levels)`.
Caused by warning:
! 1 unknown level in `f`: Po
# the warning about the missing level "Po" is OK.  No basements were rated poor
ames_all$bsmt_qual |> levels()
[1] "no_bsmt" "Fa"      "TA"      "Gd"      "Ex"     
fq_levels <- c("no_fireplace", "Po", "Fa", "TA", "Gd", "Ex")
ames_all <- ames_all |> 
  dplyr::mutate(fireplace_qu = forcats::fct_relevel(fireplace_qu, fq_levels)) 

ames_all$fireplace_qu |> levels()
[1] "no_fireplace" "Po"           "Fa"           "TA"           "Gd"          
[6] "Ex"          

Missing data notes: Basement and fireplace quality both have high numbers of missing values. The codebook explicitly states that ‘NA’ for these variables represents observations that do not have garages or fireplaces (we can also see that these observations have either garage areas of 0 or 0 fireplaces respectively). These observations may be better represented as “no basement” ” and “no fireplace” than missing (though this may not be best for the ordinal nature of these variables). We can explore that further during modeling EDA

Numeric variables

Explore min and max values for numeric variables, recording notes on any observations that look suspicious or potentially invalid. Use the data dictionary and associated variables to help you decide whether suspicious observations may represent (in)valid responses.

# skim data, looking at numeric min and max values
ames_all |>
  skim_some() |> 
  filter(skim_type == "numeric") |>  # Select only numeric variables since min/max only apply to them
  select(skim_variable, numeric.p0, numeric.p100)
# 14 rooms above ground isn't impossible, but seems high. Lets take a look
ames_all |> 
  filter(tot_rms_abv_grd == 14) |>
sale_price garage_area neighborhood ms_sub_class total_bsmt_sf bsmt_qual central_air tot_rms_abv_grd fireplaces fireplace_qu
2e+05 0 SWISU 190 1440 TA Y 14 0 no_fireplace

Numeric variable notes: Numeric values appear to be in the expected range. There were no numeric values coded as factor that made sense to convert to examine min/max values (e.g. ms_sub_class is not ordinal, doesn’t make sense to look at numeric values). I thought 14 rooms above ground seemed like a lot and looked at that one further – the observation with 14 rooms above ground had an ms_sub_class of “190” indicating it is a 2 family conversion home, which makes this number more believable. No changes need to be made at this step because its not an obvious error. We can explore it as a possible outlier further in modeling EDA

Categorical variables

print the levels of each categorical variable. Used walk() to do all the categorical variables at once. You can use tidy_responses() (a function from John) to convert all responses to snake_case. Check to make sure all levels converted properly. If needed, correct response levels with conversion errors using mutate() and fct_recode(). Document observations for categorical levels.

# View all categorical response labels
ames_all |> 
  select(where(is.factor)) |>
  walk(\(column) print(levels(column)))
 [1] "Blmngtn" "Blueste" "BrDale"  "BrkSide" "ClearCr" "CollgCr" "Crawfor"
 [8] "Edwards" "Gilbert" "Greens"  "GrnHill" "IDOTRR"  "Landmrk" "MeadowV"
[15] "Mitchel" "NAmes"   "NoRidge" "NPkVill" "NridgHt" "NWAmes"  "OldTown"
[22] "Sawyer"  "SawyerW" "Somerst" "StoneBr" "SWISU"   "Timber"  "Veenker"
 [1] "020" "030" "040" "045" "050" "060" "070" "075" "080" "085" "090" "120"
[13] "150" "160" "180" "190"
[1] "no_bsmt" "Fa"      "TA"      "Gd"      "Ex"     
[1] "N" "Y"
[1] "no_fireplace" "Po"           "Fa"           "TA"           "Gd"          
[6] "Ex"          
# Tidy all character responses except ms_sub_class (since these are numbers and do not need to be tidied)
ames_all <- ames_all |>
  mutate(across(where(is.factor) & !all_of("ms_sub_class"), tidy_responses))

# Check response labels
ames_all |> 
  select(where(is.factor)) |>
  walk(\(column) print(levels(column)))
 [1] "blmngtn" "blueste" "brdale"  "brkside" "clearcr" "collgcr" "crawfor"
 [8] "edwards" "gilbert" "greens"  "grnhill" "idotrr"  "landmrk" "meadowv"
[15] "mitchel" "names"   "noridge" "npkvill" "nridght" "nwames"  "oldtown"
[22] "sawyer"  "sawyerw" "somerst" "stonebr" "swisu"   "timber"  "veenker"
 [1] "020" "030" "040" "045" "050" "060" "070" "075" "080" "085" "090" "120"
[13] "150" "160" "180" "190"
[1] "no_bsmt" "fa"      "ta"      "gd"      "ex"     
[1] "n" "y"
[1] "no_fireplace" "po"           "fa"           "ta"           "gd"          
[6] "ex"          

Categorical variable notes: Since tidy_responses adds x before numeric variables, we opted to tidy all character responses except ms_sub_class so we would not need to correct these labels later (there is nothing to tidy when responses are completely numeric). We notice that ms_sub_class is best considered nominal, quality variables are best considered ordinal , and that neighborhood has many levels that I might consider collapsing, but these types of conversions will occur during eda_modeling.

Train test split

Now that we have completed our data cleaning, we will split our data into train and test sets and save out the cleaned files. Since John held out a separate test set from the data we were given, your split will actually create our training and validation sets. We will use his holdout set as the test data.

Generate a train/test split

Assign 25% of the data to be our validation set. Stratify this split on the sale_price variable.

splits <- ames_all %>% 
  initial_split(prop = 3/4, strata = "sale_price", breaks = 4)

Save cleaned files

Save out cleaned train and validation sets as csv files and name them hw_unit_2_train.csv and hw_unit_2_val.csv.

splits |> 
  analysis() |> 
  glimpse() |> 
  write_csv(here::here(path_data, "ames_clean_class_trn.csv"))
splits |> 
  assessment() |> 
  glimpse() |> 
  write_csv(here::here(path_data, "ames_clean_class_val.csv"))
