To begin, download the following from the course web book (Unit 10):
hw_unit_10_neural_nets.qmd (notebook for this assignment)
wine_quality_trn.csv and wine_quality_test.csv (data for this assignment)
The data for this week’s assignment include wine quality ratings for various white wines. The outcome is quality - a categorical (“high_quality” or “low_quality”) outcome. There are 11 numeric predictors that describe attributes of the wine (e.g., acidity) that may relate to the overall quality rating.
We will be doing another competition this week. You will fit a series of neural network model configurations and select the best configuration among them. You will use the wine_quality_test.csv file only once at the end of the assignment to generate your best model’s predictions in the held-out test set (the test set will not include outcome labels). We will assess everyone’s predictions on the held-out set to determine the best fitting model in the class. The winner again gets a free lunch from John!
Note that this week’s assignment is less structured than previous assignments, allowing you to make more independent decisions about how to approach all aspects of the modeling process. Try to use the knowledge that you have developed from the work in previous assignments to explore your data, generate features, and examine different model configurations.
For this assignment, you will fit neural networks using set_engine("brulee"). The brulee package is a pure R implementation. If you haven’t already installed it, run install.packages("brulee") once before starting.
Let’s get started!
Setup
Set up your notebook in this section. You will want to be sure to set your path_data and initiate parallel processing here! The brulee package should be installed (run install.packages("brulee") if needed) but does not need to be explicitly loaded — it is accessed through tidymodels/parsnip.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Across the model configurations you compare, you must fit models that vary with respect to:
Hidden layer architecture: Consider at least two different numbers of units within the hidden layer.
Controlling overfitting: Consider L2 regularization (using the penalty argument in mlp()) with at least two associated hyperparameter values.
Hidden layer activation function: Consider at least two activation functions for the hidden layer.
Some things to consider
About configurations
Of course you can’t choose between model configurations based on training set performance, so you’ll need to use some resampling technique. There are different costs and benefits associated with different resampling techniques (e.g., bias, variance, computing time), so you’ll need to decide which technique is the best for your needs for this assignment. Specifically, you should choose among a validation split, k-fold cross-validation, or bootstrapping for cross-validation.
Resampling and tuning
Given what you’ve learned about resampling/tuning, think about how you might move systematically through model configurations rather than haphazardly changing model configuration characteristics.
Consider compute time
As you weigh computing costs, think about what that might look like given your current context. For example, imagine that each model configuration takes 2 minutes to fit. If you want to use 100 bootstraps for CV, that means ~200 minutes (just over 3 hours) per model configuration. Now imagine you’re comparing 8 different model configurations. That multiplies your 200 minutes by 8, which starts to get pretty long. If you’re using 10-fold CV, that means it only takes 20 minutes per model configuration, so you might be able to compare more configurations. A validation split would be even simpler. Think about the costs and benefits of each approach, pick one and motivate it with commentary in your submission.
Performance metric: accuracy
Regardless of the resampling technique you choose, please compare models using accuracy as your performance metric.
Perform some modeling EDA. How will you scale your features? Should your features be normalized, scaled, or transformed in some other way? Decorrelated? Provide an explanation for your decisions here along with as much EDA as you find necessary.
# No zero-variance predictors with this dataset, so step_zv is not critical# but we include it as good practice
# Check predictor distributions to decide on scaling approachdata_trn |>select(-quality) |>pivot_longer(everything(), names_to ="predictor", values_to ="value") |>ggplot(aes(x = value)) +geom_histogram(bins =30) +facet_wrap(~predictor, scales ="free") +labs(title ="Distribution of predictors")
Gradient descent works better with inputs on the same scale, and L2 regularization requires comparable variance across predictors. We will compare a few scaling strategies: range scaling (0-1), standardization (step_normalize), and standardization with Yeo-Johnson transformation to handle skewness. We will use a validation split to select the best feature engineering approach before full hyperparameter tuning.
# Recipe 1: Range scaling (0-1)rec_range <-recipe(quality ~ ., data = data_trn) |>step_range(all_predictors())# Recipe 2: Standardization (zero mean, unit variance) -- our primary candidaterec_norm <-recipe(quality ~ ., data = data_trn) |>step_normalize(all_predictors())# Recipe 3: Standardization + Yeo-Johnson to handle skewed predictorsrec_norm_yj <-recipe(quality ~ ., data = data_trn) |>step_YeoJohnson(all_predictors()) |>step_normalize(all_predictors())
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.751 1 NA pre0_mod0_post0
collect_metrics(fit_nnet_norm)
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.779 1 NA pre0_mod0_post0
collect_metrics(fit_nnet_norm_yj)
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.779 1 NA pre0_mod0_post0
Standardization (rec_norm) performs best or comparably to the other approaches and is the simplest. We will proceed with rec_norm for all subsequent hyperparameter tuning.
Fit models
Depending on the resampling technique you choose and how you plan to examine various model configurations, your code set-up is going to look different. Create as many code chunks as needed for whatever your approach.
Fit all models using set_engine("brulee"). The key tunable hyperparameters available through brulee are hidden_units, activation (options: "relu", "tanh", "sigmoid", "elu", "linear"), penalty (L2 regularization), and epochs. Refer to the Unit 10 lecture for example model fits.
Where needed, please include some annotation (in text outside of code chunks and/or in commented lines within code chunks) to help us review your assignment. You don’t need to tell us everything you’re doing because that should be relatively clear from the code! A good rule of thumb is to use annotation to tell us why you’re doing something (e.g., if you had several choices for how to proceed, why did you choose the one you did?) but you don’t have to describe what you’re doing (e.g., you don’t need to tell us you’re building a recipe - we will see that in your code).
We will use 10-fold cross-validation with 3 repeats (stratified on quality) for hyperparameter tuning. This provides a lower-variance performance estimate than a single validation split at the cost of compute time. Given that the dataset is small enough to make this tractable, this is preferred over a single validation split or bootstrapping (which would be even more compute-intensive for minimal gain in bias reduction).
# 10-fold CV with 3 repeats for full hyperparameter tuningset.seed(102030)splits_kfold <- data_trn |>vfold_cv(v =10, repeats =3, strata ="quality")
We build a grid that covers: - Hidden units: 5, 10, 20, 30, 50, 100 (wide range from simple to complex) - Activation: relu and tanh (two distinct function families to meet the assignment requirement) - Penalty: 0.0001, 0.001 (larger values can cause numerical overflow with brulee)
config_grid <-expand_grid(hidden_units =c(5, seq(10, 50, by =10), 100),activation =c("relu", "tanh"),penalty =c(0.0001, 0.001))nrow(config_grid) # number of configurations
This section is for generating predictions for your best model in the held-out test set. You should only generate predictions for one model out of all configurations you examined. We will use these predictions to generate your one estimate of model performance in the held-out data.
Add your last name between the quotation marks to be used in naming your predictions file.
Run this code chunk to save out your best model’s predictions in the held-out test set. Look over your predictions to confirm your model generated valid predictions for each test observation. Make sure the file containing your predictions has the form that you think it should. This requires visually inspecting the output csv file after you write it. The glimpse() call helps too.
Rows: 1,224
Columns: 1
$ quality <fct> high quality, low quality, high quality, low quality, high qua…
Save & render
Render this .qmd file with your last name at the end (e.g., hw_unit_10_neural_nets_name.qmd). Make sure you changed “Your name here” at the top of the file to be your own name. Render the file to html. Upload the rendered file and your saved test_preds_lastname.csv to Canvas.
We will assess everyone’s predictions in the test, and the winner gets a free lunch from John!