IAML Unit 6: Discussion

Announcements

Midterm

Application exam available now. Due NEXT Friday at 8pm
Concepts exam in discussion section next Thursday
Review for concepts exam next Tuesday in Lab (w/ me)
no office hours today (by appt Friday)

Curse of Dimensionality

What is high dimensional data?
What happens when p >= n in GLM
Connect to issues of overfitting and interpretation
Why would we even want high dimensional data?
Describe two broad categories of approaches to solve high dimensionality problem
- Feature selection (subset approaches, LASSO)
- Dimensionality reduction (Ridge, PCA, PCR, PLS, autoencoders, word embeddings)

Steps for three subsetting approaches

Three approaches

Best subset
Forward stepwise
Backward stepwise
Why can you select among models with fixed p using training error
Why do you need test error to select among models with different p

Advantages and disadvantages

Best subset
- potentially better at finding DPG because considers all possible models for specific k
- Computationally very expensive (can’t do with p > 40?)
- LASSO is very similar and computationally much faster!
Forward stepwise
- Lower computational costs than best subset
- Wont always find best model (3 X example)
- Can be used with p > n (stop when p < n)
Backward stepwise
- Lower computational costs than best subset
- Maybe better in scenarios like 3 X example
- Can’t be used with p > n

Cost functions for OLS, LASSO and Ridge

NEED TO KNOW THESE FULL FORMULAS (AT LEAST FOR REGRESSION) and PENALTY FOR CLASSIFICATION

OLS

\(\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}\)

Ridge (l2)

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)

LASSO

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} |\beta_j|\:])\)

ADVANCED: Where else have you seend L1 and L2 norms before? (hint, you have seen them twice before)

Euclidean distance is the L2 norm and Manhattan distance is the L1 norm.
OLS regression vs. robust regression (fit with sum of absolute errors) is another example of L2 vs. L1 norms.
L1 and L2 norms calculate the magnitude of a vector. We used magnitude to calculate the distance of the vector between two points in n dimensional space. We mimimized the magnitude of a vector composed of the errors in regression. Here we are calculating the magnitude of a vector composed of all the parameter estimates (to apply a penalty based on that magnitude)

Lambda

How does the value of Lambda affect LASSO and Ridge?

LASSO vs. Ridge Comparison

With respect to the parameter estimates:

LASSO yields sparse solution (some parameter estimates set to exactly zero)
Ridge tends to retain all features (parameter estimates don’t get set to exactly zero)
LASSO selects one feature among correlated group and sets others to zero
Ridge shrinks all parameter estimates for correlated features

Ridge tends to outperform LASSO wrt prediction in new data. There are cases where LASSO can predict better (most features have zero effect and only a few are non-zero) but even in those cases, Ridge is competitive.

Advantages of LASSO

Does feature selection (sets parameter estimates to exactly 0)
- Yields a sparse solution
- Sparse model is more interpretable?
- Sparse model is easier to implement? (fewer features included so don’t need to measure as many predictors)
More robust to outliers (similar to LAD vs. OLS)
Tends to do better when there are a small number of robust features and the others are close to zero or zero

Advantages of Ridge

Computationally superior (closed form solution vs. iterative; Only one solution to minimize the cost function)
More robust to measurement error in features (remember no measurement error is an assumption for unbiased estimates in OLS regression)
Tends to do better when there are many features with large (and comparable) effects (i.e., most features are related to the outcome)

Elasticnet

Cost function

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda (\alpha\sum_{j = 1}^{p} |\beta_j| + (1-\alpha)\sum_{j = 1}^{p} \beta_j^{2})\:])\)
Lambda works the same
What does alpha (blending/mixture hyperparameter) do

PCA, PRC, and PLS

PCA

See appendix and section 12.2 in James et al

PCR

GLM with PCA as part of feature engineering
Tune number of components retained
Doesnt derived components based on Y
Very similar to Ridge

PLS

Modification where components are derived to predict Y
Still prefer ridge generally

All do dimensionality reduction but NOT feature selection. Still need to measure all predictors

Interpretation is reduced with these methods because the features are linear combinations of the original predictors. So we can’t say anything about the importance of the original predictors in these models.