Message Boards Message Boards

SGD optimizers with online gradient regression Hessian estimation

enter image description here

POSTED BY: Jarek Duda
6 Replies

Nice visualizations! A short self-contained example for how you did the 3D optimizer plot would be useful to me.

On a personal note, I've spent quite a while trying to make second order optimizers work, latest attempt is here, and I've lost faith in black box approaches.

Black-box approach (ie, treating neural network as a generic function) makes it easy to improve over SGD in low dimensions, but not in high dimensions. The issue is that high dimensionality + mini-batching means that minibatch objective is well conditioned, meaning that gradient update is already high quality.

However, I think white-box approaches are promising. In fact, one could hypothesise that much of the progress in the last 5 years came from coming up with better update rules, except these tweaks are hiding in the architecture. IE, batch norm layer, skip connections, those could be seen as tweaks that make update more efficient, and could've been implemented as an optimizer tweak, rather than architecture tweak.

POSTED BY: Yaroslav Bulatov

Thanks Yaroslav, visualizations are great in Mathematica ... here also performance is very nice, should be taken to python but it's above my patience, students were supposed to do it but it's hard to count on them - in any case I would gladly collaborate ... indeed low dimensional performance might no longer be true for high dimensions and nasty landscapes (?)

Regarding "3D optimizer", the demonstration contains only 2D, for 3D I only use the same type of evaluation: start with a lattice of initial positions, perform some number of steps for each, sort final values (for 0 as global minimum) - getting ~empirical distributions for various methods.

I see you worked on K-FAC like the Google team - I think they recently switched to shampoo, are there 2nd order methods used outside research? Here are my slides gathering various 2nd order: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf . Gauss-Newton, K-FAC, L-BFGS use this positive defined H>0 approximation - that very nonconvex function is convex, I am not able to trust. Also K-FAC still uses huge Hessian models.

These two Hessian approximations I use seem very practical for high dimension: "d" as diagonal handled as vector (separate parabola model in each canonical direction), and "s" full Hessian in locally dominating e.g. d=10 dim. subspace (e.g. online PCA of gradients). My OGR approach is the closest to quasi Newton like L-BFGS: Hessian model from sequence of gradients - with advantages: real Hessian (no H>0 positive defined approximation), much lower computational cost (mainly EMA update of averages), more local (exponentially weakening weights), flexible (e.g. allowing to slowly rotate the considered subspace), and good control (working on interpretable means, variance, covariances).

POSTED BY: Jarek Duda

I'm not aware of second-order methods being successful for standard neural network architectures. Shampoo exists, but it's not clear that it outperforms Adam.

There's a theory that standard network architectures are already close to optimal for SGD, or they wouldn't be popular. It's an architecture/optimizer co-evolution issue.

The issue with non-convexity/saddlepoints has received a lot of attention, consider the 1500 [citations] of "Identifying and attacking the saddle point problem". Nothing of practical value appears to have come out of this line of work. High-dimensional setting breaks many common intuitions we have about optimization. Misha Belkin's "fit without fear" is a good read to get different perspective -- https://arxiv.org/abs/2105.14368 .

Things are more promising for second-order methods in domains where neural network design is constrained by the problem. KFAC appears to be an important component of getting some chemistry-related networks training, check out publications of http://davidpfau.com/ for the example .

I think advanced optimizers may be important for model compression. Wide models are easy to train with SGD, but narrow + deep are hard. For good performance on small devices (phones, hearing aids), you want small models, and may eventually want to train small models from scratch. SGD works badly on tiny models, so that could be a potential way to get impact from more advanced optimizers.

POSTED BY: Yaroslav Bulatov

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial columns Staff Picks http://wolfr.am/StaffPicks and Publication Materials https://wolfr.am/PubMat and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD

Thanks, indeed it might be difficult to get successful second order methods for training, but maybe it is still worth to search. For example the basic approach (cdOGR) in diagram below turned out similar to ADAM-like: also derived division by mean squared gradient coordinates. But additionally in nominator there is similar for position, what is natural when looking at shown "position vs gradient plot" - to find the g=0 minimum, we should divide gradient by width in g, but also multiply by width in theta.

Regarding other applications, I though about about continuous, metalearning - to hold beside the actual model, also model of Hessian there - e.g. in these diagonal/subspace approximations, to better exploit new scarce gradients.

Regarding saddle attraction, sure I know and cite saddle-free Newton, and use its |lambda|. However, there is this nice paper: https://arxiv.org/pdf/1902.02366 , calculating 3Mx3M Hessian, and suggesting asymmetry - different behavior for negative curvatures. Also, a few first dominant eigenvectors have rather positive curvature - so e.g. in subspace "s" OGR variants we can restrict to such directions.

enter image description here

POSTED BY: Jarek Duda

Remark: while the approximated 'c' variants above have the fastest convergence, it is for situation without noise. They assume corr(g,theta) ~ 1, but with high noise in gradient g, this correlation gets close to zero - leading to impractically small steps e.g. for neural network training.

Therefore. for noisy gradient situations, like neural network training (due to finite batch), use variants without 'c' approximation, like: dOGR, sOGR, dsOGR.

POSTED BY: Jarek Duda
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract