Nuit Blanche: Second-Order Optimization for Non-Convex Machine Learning, FLAG, GIANT and Non-Convex Optimization Under Inexact Hessian

Monday, September 25, 2017

Second-Order Optimization for Non-Convex Machine Learning, FLAG, GIANT and Non-Convex Optimization Under Inexact Hessian

I like the fact that SGD hyperparameter tuning could be reduced.

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study by Peng Xu, Farbod Roosta-Khorasan, Michael W. Mahoney

The resurgence of deep learning, as a highly effective machine learning paradigm, has brought back to life the old optimization question of non-convexity. Indeed, the challenges related to the large-scale nature of many modern machine learning applications are severely exacerbated by the inherent non-convexity in the underlying models. In this light, efficient optimization algorithms which can be effectively applied to such large-scale and non-convex learning problems are highly desired. In doing so, however, the bulk of research has been almost completely restricted to the class of 1st-order algorithms. This is despite the fact that employing the curvature information, e.g., in the form of Hessian, can indeed help with obtaining effective methods with desirable convergence properties for non-convex problems, e.g., avoiding saddle-points and convergence to local minima. The conventional wisdom, in the machine learning community is that the application of 2nd-order methods, i.e., those that employ Hessian as well as gradient information, can be highly inefficient. Consequently, 1st-order algorithms, such as stochastic gradient descent (SGD), have been at the center-stage for solving such machine learning problems. Here, we aim at addressing this misconception by considering efficient and stochastic variants of Newton's method, namely, sub-sampled trust-region and cubic regularization, whose theoretical convergence properties have recently been established in [Xu 2017]. Using a variety of experiments, we empirically evaluate the performance of these methods for solving non-convex machine learning applications. In doing so, we highlight the shortcomings of 1st-order methods, e.g., high sensitivity to hyper-parameters such as step-size and undesirable behavior near saddle-points, and showcase the advantages of employing curvature information as effective remedy.

GIANT: Globally Improved Approximate Newton Method for Distributed Optimization by Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, Michael W. Mahoney

For distributed computing environments, we consider the canonical machine learning problem of empirical risk minimization (ERM) with quadratic regularization, and we propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, and then it sends this direction to the main driver. The driver, then, averages all the ANT directions received from workers to form a Globally Improved ANT (GIANT) direction. GIANT naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. GIANT is highly communication efficient in that, for

d

-dimensional data uniformly distributed across m

workers, it has 4

or 6

rounds of communication and O(dlogm)

communication complexity per iteration. Theoretically, we show that GIANT's convergence rate is faster than first-order methods and existing distributed Newton-type methods. From a practical point-of-view, a highly beneficial feature of GIANT is that it has only one tuning parameter---the iterations of the local solver for computing an ANT direction. This is indeed in sharp contrast with many existing distributed Newton-type methods, as well as popular first order methods, which have several tuning parameters, and whose performance can be greatly affected by the specific choices of such parameters. In this light, we empirically demonstrate the superior performance of GIANT compared with other competing methods.

FLAG: Fast Linearly-Coupled Adaptive Gradient Method by Xiang Cheng, Farbod Roosta-Khorasani, Peter L. Bartlett, Michael W. Mahoney

The celebrated Nesterov's accelerated gradient method offers great speed-ups compared to the classical gradient descend method as it attains the optimal first-order oracle complexity for smooth convex optimization. On the other hand, the popular AdaGrad algorithm competes with mirror descent under the best regularizer by adaptively scaling the gradient. Recently, it has been shown that the accelerated gradient descent can be viewed as a linear combination of gradient descent and mirror descent steps. Here, we draw upon these ideas and present a fast linearly-coupled adaptive gradient method (FLAG) as an accelerated version of AdaGrad, and show that our algorithm can indeed offer the best of both worlds. Like Nesterov's accelerated algorithm and its proximal variant, FISTA, our method has a convergence rate of 1/T2 after T iterations. Like AdaGrad our method adaptively chooses a regularizer, in a way that performs almost as well as the best choice of regularizer in hindsight.

Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information by Peng Xu, Farbod Roosta-Khorasani, Michael W. Mahoney

We consider variants of trust-region and cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under mild conditions on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve
ϵ

-approximate second-order optimality which have shown to be tight. Our Hessian approximation conditions constitute a major relaxation over the existing ones in the literature. Consequently, we are able to show that such mild conditions allow for the construction of the approximate Hessian through various random sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and cubic regularization methods.