Adam has four hyperparameters:
Learning rate (\(\alpha\)) — The step size applied to each parameter update. Controls how far the optimizer moves in the direction of the gradient at each step. Typically set around 0.001. Too high and training is unstable; too low and it’s slow.
\(\beta_1\) — The exponential decay rate for the first moment estimate (the running average of the gradients themselves). Controls how much momentum carries forward from previous gradients. Default is 0.9, meaning 90% of the previous gradient direction is retained and 10% is the new gradient.
\(\beta_2\) — The exponential decay rate for the second moment estimate (the running average of the squared gradients). Controls how quickly the optimizer adapts its step size per parameter based on recent gradient magnitudes. Default is 0.999. A high value means the per-parameter scaling changes slowly.
\(\epsilon\) — A small constant added to the denominator to prevent division by zero when the second moment estimate is near zero. Default is around \(10^{-8}\). Rarely tuned in practice.
https://github.com/tidymodels/brulee/blob/main/R/mlp-fit.R
| name | acc_test |
|---|---|
| Wang | 0.78 |
| Tao | 0.77 |
| Gu | 0.77 |
| Bublitz | 0.77 |
| Huffaker | 0.77 |
| Lilin | 0.77 |
| Dang | 0.76 |
| Pratt | 0.76 |
| Sanford | 0.76 |
| Zhou | 0.76 |
| Gurney | 0.76 |
| Hosangadi | 0.76 |
| Jeong | 0.76 |
| Oh | 0.76 |
| Lipeiqi | 0.76 |
| Lee | 0.76 |
| Vanvleet | 0.76 |
| Mergendahl | 0.76 |
| Burleson | 0.75 |
| Duffy | 0.75 |
| Johnston | 0.75 |
| Odhiambo | 0.75 |
| Kountakis | 0.74 |
| Pan | 0.74 |
| Sweat | 0.66 |
The Algorithm
For a feature of interest \(X_s\) and all other features \(X_c\) (“complement”):
What problems would this approach have?
The Intuition ALE plots solve the correlated-feature problem in PDPs by changing what question they ask. Instead of “what happens when I force \(X_s\) to equal \(v\) across all observations?”, ALE asks: “among observations where \(X_s\) is already near \(v\), how does a small change in \(X_s\) affect predictions?”
The Algorithm
Why This Fixes the Correlation Problem The key move is only using observations that actually live near \(v\) when estimating the effect at \(v\). You’re never forcing a feature into a region it doesn’t naturally occupy in combination with the other features. The small nudge from \(z_{k-1}\) to \(z_k\) stays within the realistic joint distribution of the data.

Model specific vs. agnostic
The Algorithm For a single observation and a single feature \(j\):
A Concrete Example Say you have 3 features: age, income, education. To get the Shapley value for income for one specific person:
The “Missing Features” Problem When evaluating \(\hat{f}(S)\) — a subset that excludes some features — the model still needs values for those missing features. The standard approach is to marginalize over the training data: replace missing features with their values drawn from the training set, averaging over many such draws. This is what makes Shapley values expensive.
Get into small groups of 2-3. Think about one of the data sets we have worked with in class or your own! Create a research question you might ask using each of the following explanatory methods.
