8 Dimensionality Reduction
Complex models increase the chance of overfitting to the training sample. This leads to:
- Poor prediction
- Burdensome prediction models for implementation (need to measure lots of predictors)
- Low power to test hypothesis about predictor effects
Complex models are difficult to intepret
Complexity generally increases with:
- Non-parametric models
- Unconstrained effects of predictors in parametric models (e.g., allowing for big coefficient for X5)
- Inclusion of predictors with small or no (noise) effects
- Increased number of predictors
- Increased ratio of p to n
Many parametric models (linear model, generalized linear model, linear discriminant analysis):
- Become very overfit as p approaches n
- Can not be used when p >= n
- Yet today we often have p >> n (NLP studies, genetics, precision medicine)
To reduce overfitting and/or allow p >> n, We need methods that can:
- Reduce effective p (select or combine)
- Constrain coefficients
We will consider three methods to accomplish this:
- Subset selection (and filtering more generally)
- Regularization (Shrinkage)
- Dimensionality Reduction