Midterm
Three approaches
Best subset
Forward stepwise
Backward stepwise
Why can you select among models with fixed p using training error
Why do you need test error to select among models with different p
Advantages and disadvantages
NEED TO KNOW THESE FULL FORMULAS (AT LEAST FOR REGRESSION) and PENALTY FOR CLASSIFICATION
OLS
Ridge (l2)
LASSO
ADVANCED: Where else have you seend L1 and L2 norms before? (hint, you have seen them twice before)
Euclidean distance is the L2 norm and Manhattan distance is the L1 norm.
OLS regression vs. robust regression (fit with sum of absolute errors) is another example of L2 vs. L1 norms.
L1 and L2 norms calculate the magnitude of a vector. We used magnitude to calculate the distance of the vector between two points in n dimensional space. We mimimized the magnitude of a vector composed of the errors in regression. Here we are calculating the magnitude of a vector composed of all the parameter estimates (to apply a penalty based on that magnitude)
With respect to the parameter estimates:
LASSO yields sparse solution (some parameter estimates set to exactly zero)
Ridge tends to retain all features (parameter estimates don’t get set to exactly zero)
LASSO selects one feature among correlated group and sets others to zero
Ridge shrinks all parameter estimates for correlated features
Ridge tends to outperform LASSO wrt prediction in new data. There are cases where LASSO can predict better (most features have zero effect and only a few are non-zero) but even in those cases, Ridge is competitive.
Does feature selection (sets parameter estimates to exactly 0)
More robust to outliers (similar to LAD vs. OLS)
Tends to do better when there are a small number of robust features and the others are close to zero or zero
Computationally superior (closed form solution vs. iterative; Only one solution to minimize the cost function)
More robust to measurement error in features (remember no measurement error is an assumption for unbiased estimates in OLS regression)
Tends to do better when there are many features with large (and comparable) effects (i.e., most features are related to the outcome)
Cost function
PCA
PCR
YPLS
YAll do dimensionality reduction but NOT feature selection. Still need to measure all predictors
