8 Dimensionality Reduction

Complex models increase the chance of overfitting to the training sample. This leads to:

  • Poor prediction
  • Burdensome prediction models for implementation (need to measure lots of predictors)
  • Low power to test hypothesis about predictor effects

Complex models are difficult to intepret

Complexity generally increases with:

  • Non-parametric models
  • Unconstrained effects of predictors in parametric models (e.g., allowing for big coefficient for X5)
  • Inclusion of predictors with small or no (noise) effects
  • Increased number of predictors
  • Increased ratio of p to n

Many parametric models (linear model, generalized linear model, linear discriminant analysis):

  • Become very overfit as p approaches n
  • Can not be used when p >= n
  • Yet today we often have p >> n (NLP studies, genetics, precision medicine)

To reduce overfitting and/or allow p >> n, We need methods that can:

  • Reduce effective p (select or combine)
  • Constrain coefficients

We will consider three methods to accomplish this:

  • Subset selection (and filtering more generally)
  • Regularization (Shrinkage)
  • Dimensionality Reduction

8.1 Principal Components Regression

8.2 Partial Least Squares