It is well-known that the performance of large-sized deep neural networks (DNNs) is sensitive to the setting of their hyperparameters. Hyperparameter optimization is thus recognized as a crucial step in the process of applying DNNs to achieve best performance and drive industrial applications. For many years, the de-facto standard for hyperparameter tuning in deep learning has been a simple grid search. Recently, Bayesian optimization has been proposed for automatic hyperparameter tuning. However, it can hardly tune more than 20 hyperparameters simultaneously. Furthermore, the elementary- and hyper-parameter optimization tasks are usually solved separately where the hyperparameter optimization process, defined as the outer loop does not make full use of the inner elementary optimization process. To address these issues, we propose effective, efficient and scalable gradient-based methods for optimizing elementary- and hyper-parameters in DNNs in a unified manner. The first is a novel approximate method, DrMAD, for obtaining gradients with respect to hyperparameters based on asymmetric reverse-mode automatic differentiation. It is 15 ∼ 45 times faster and consumes 50 ∼ 100 times less memory on a variety of benchmark datasets compared to the state-of-the-art methods for optimizing hyperparameters with minimal compromise to its effectiveness. Inspired by the approximate nature of DrMAD, we develop an adaptive and approximate gradient-based method for optimizing elementary parameters in DNNs, which is more effective. We also propose an effective, efficient and scalable neural optimizer using a recurrent v neural network (RNN) for tuning dynamic parameter-wise hyperparameters of another DNN. The proposed neural optimizer is trained using the approximate hypergradients obtained from DrMAD. Extensive experiments show that our approach outperforms the state-of-the-art neural optimizer in terms of classification accuracy of the DNN being optimized for long horizons, but converges at least 20 times faster and consumes about 100 times less memory. To the best of our knowledge, the works described in this thesis represent the first forays into the scalable gradient-based methods for elementary- and hyper-parameter optimization in DNNs in a unified manner
Here is an attedant preprint: DrMAD Distilling Reverse-Mode Automatic Differentiation for Optimizing Hyperparameters of Deep Neural Networks by Jie Fu, Hongyin Luo, Jiashi Feng, Kian Hsiang Low, Tat-Seng Chua
The performance of deep neural networks is well-known to be sensitive to the setting of their hyperparameters. Recent advances in reverse-mode automatic differentiation allow for optimizing hyperparameters with gradients. The standard way of computing these gradients involves a forward and backward pass of computations. However, the backward pass usually needs to consume unaffordable memory to store all the intermediate variables to exactly reverse the forward training procedure. In this work we propose a simple but effective method, DrMAD, to distill the knowledge of the forward pass into a shortcut path, through which we approximately reverse the training trajectory. Experiments on several image benchmark datasets show that DrMAD is at least 45 times faster and consumes 100 times less memory compared to state-of-the-art methods for optimizing hyperparameters with minimal compromise to its effectiveness. To the best of our knowledge, DrMAD is the first research attempt to make it practical to automatically tune thousands of hyperparameters of deep neural networks. The code can be downloaded from this https URL
The DrMAD GitHub is here: https://github.com/bigaidream-projects/drmad
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.