Kline Tower, 13th Floor, Rm. 1327
Webcast Option: https://yale.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=dca8dff4-f…
Kline Tower, 13th Floor, Rm. 1327
Webcast Option: https://yale.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=dca8dff4-f…
Abstract: A key puzzle in deep learning is how simple gradient methods find generalizable solutions without explicit regularization. This talk discusses the implicit regularization of gradient descent (GD) through the lens of statistical dominance. Using least squares as a clean proxy, we present two surprising findings.
First, GD dominates ridge regression. For any well-specified Gaussian least squares problem, the finite-sample excess risk of optimally stopped GD is no more than a constant times that of optimally tuned ridge regression. However, there is a natural subset of these problems where GD achieves a polynomially smaller excess risk. Thus, implicit regularization is statistically superior to explicit regularization, in addition to its computational advantages.
Second, GD and online stochastic gradient descent (SGD) are incomparable. We construct a sequence of well-specified Gaussian least squares problems where optimally stopped GD is polynomially worse than online SGD, and similarly vice versa. Our construction leverages a key insight from benign overfitting, revealing a fundamental separation between batch and online learning.
This is joint work with Peter Bartlett, Sham Kakade, Jason Lee, and Bin Yu.
3:30pm - Pre-talk meet and greet teatime - 219 Prospect Street, 13 floor, there will be light snacks and beverages in the kitchen area. For more details and upcoming events visit our website at https://statistics.yale.edu/calendar.