Thanks Yaroslav, visualizations are great in Mathematica ... here also performance is very nice, should be taken to python but it's above my patience, students were supposed to do it but it's hard to count on them - in any case I would gladly collaborate ... indeed low dimensional performance might no longer be true for high dimensions and nasty landscapes (?)
Regarding "3D optimizer", the demonstration contains only 2D, for 3D I only use the same type of evaluation: start with a lattice of initial positions, perform some number of steps for each, sort final values (for 0 as global minimum) - getting ~empirical distributions for various methods.
I see you worked on K-FAC like the Google team - I think they recently switched to shampoo, are there 2nd order methods used outside research?
Here are my slides gathering various 2nd order: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf . Gauss-Newton, K-FAC, L-BFGS use this positive defined H>0 approximation - that very nonconvex function is convex, I am not able to trust. Also K-FAC still uses huge Hessian models.
These two Hessian approximations I use seem very practical for high dimension: "d" as diagonal handled as vector (separate parabola model in each canonical direction), and "s" full Hessian in locally dominating e.g. d=10 dim. subspace (e.g. online PCA of gradients).
My OGR approach is the closest to quasi Newton like L-BFGS: Hessian model from sequence of gradients - with advantages: real Hessian (no H>0 positive defined approximation), much lower computational cost (mainly EMA update of averages), more local (exponentially weakening weights), flexible (e.g. allowing to slowly rotate the considered subspace), and good control (working on interpretable means, variance, covariances).