Notes on: Deep Frank-Wolfe For Neural Network Optimization

01 Dec 2018

Original Paper by: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

High Level Overview

  1. When we do standard gradient descent, we are in fact solving a closed form solution of a 1st-order Taylor Series Proximal problem (see \ref{eq:taylor_proximal_full}). We can improve the accuracy of this approximation by pushing the Taylor series approximation into the the loss function (see \ref{eq:loss_preserve}).
  2. This improvement on the approximation can be made tractable by selecting a piecewise-linear loss (i.e., multi-class hinge, L1), and it can be shown that the cost of such updates is equivalent to SGD.
  3. Furthermore, we can formulate the problem in order to determine the optimal stepsize ($\gamma$) to this proximal approximation; we do this by rewriting the primal objective in its (strong) dual objective.
  4. We can use the Franke-Wolfe (FW) algorithm to solve this to find $\gamma$. Since we are still using an approximation (see the $\mathcal{T}$ in \ref{eq:loss_preserve}) to the true proximal solution, we need to only iterate FW once, since by iterating more than once we’d be getting closer to an approximation anyway, which is arguably not helpful.

Introduction

The paper notes that SGD is the de facto algorithm of choice for training neural networks. The best performance often comes from choosing a hand-tuned learning rate, however this is time-consuming and computationally expensive (i.e., having to train and test over and over again)

Adaptive gradient methods (Adam) offer improved automation but these often lead to suboptimal learnt parameters in terms of generalisation performance.

Deep Frank-Wolfe (DFW) alleviates this by:

  1. Exploiting more information about the objective (by making less approximations) but having the same cost
  2. Calculates an optimal step-size, alleviating the need for a hand-designed schedule

Related Work

Problem Formulation

SGD solves the following closed form proximal problem at each step:

\[\begin{equation} \mathbf{w}_{t+1} = \underset{\mathbf{w}\in\mathbb{R}^p}{\arg\min} \left\{ \frac{1}{2 \eta_t} \|\mathbf{w}-\mathbf{w_t}\|^2 + \mathcal{T}_{\mathbf{w}_t}\rho(\mathbf{w}) + \mathcal{T}_{\mathbf{w}_t} [ \mathcal{L}_j(\mathbf{f}_j(\mathbf{w}))] \right\}. \label{eq:taylor_proximal_full} \end{equation}\]

We can reduce assumptions by moving the last Taylor-series approximation into the loss function, resulting in more accurate updates:

\[\begin{equation} \mathbf{w}_{t+1} = \underset{\mathbf{w}\in\mathbb{R}^p}{\arg\min} \left\{ \frac{1}{2 \eta_t} \|\mathbf{w}-\mathbf{w_t}\|^2 + \mathcal{T}_{\mathbf{w}_{t}}\rho(\mathbf{w}) + \mathcal{L}_{j} ( \mathcal{T}_{\mathbf{w}_t}\mathbf{f}_j(\mathbf{w}) ) \right\}. \label{eq:loss_preserve} \end{equation}\]

Note that there is still a hyperparameter $\eta_t$; we’d still have to fix this to a certain value.

The Deep Franke-Wolfe Algorithm

It is shown in Lacoste-Julien et al., 20131 that we can directly solve the problem of finding the optimal step-size in the dual, and they show that a lot of the power of this optimisation strategy comes from exactly this ability to find the optimal step-size.

This paper takes it a step further, and shows that in deep neural networks, given this optimal step size $\gamma_t$, the resulting parameter update in \ref{eq:loss_preserve} can be written as:

\[\begin{equation} \mathbf{w}_{t+1} = \mathbf{w}_t - \eta[\partial\rho(\textbf{w})|_{\mathbf{w}_t} + \gamma_t \partial\mathcal{L}_j (\mathbf{f}_j(\mathbf{w}))|_{\mathbf{w}_t}]. \end{equation}\]

This requires the same constituents as the equivalent gradient descent update in \ref{eq:taylor_proximal_full}. It is also possible to interpret the importance of the optimal step-size $\gamma_t$: when it is 0, we don’t move along the direction of the loss’ gradient, thus we must be at a minimum of the linear piecewise approximation of the neural net. Conversely, when it is 1, we perform the same update as you would with SGD.

The paper modifies the result from Lacoste-Julien et al., 20131 to show the optimal step-size is:

\[\begin{equation} \gamma_t = \frac{(\mathbf{w}_t - \mathbf{w}_0 - \mathbf{w}_{\mathbf{s}})^\top(\mathbf{w}_t - \mathbf{w}_0) + \eta(\lambda_{\mathbf{s}}-\lambda_t)}{||\mathbf{w}_t - \mathbf{w}_0 - \mathbf{w}_{\mathbf{s}}||^2} \end{equation}\]

which is basically the solution of finding which $\gamma$ minimises the dual objective given a ‘smoothed’ line-search direction $\mathbf{s}$. We then plug in the primal variables again to obtain this result. Note that when we are only iterating FW once (i.e., t = 1), we obtain the following simplification:

\[\begin{equation} \gamma_t = \frac{-\eta\boldsymbol{\delta}_t^\top r_t + \mathbf{s}_t^\top \mathbf{b}}{\eta ||\boldsymbol{\delta}_t||^2} \end{equation}\]

(This is just plug and chug maths covered in the appendix).

To further improve convergence, this paper uses two methods:

  1. Smoothing (the details of the direction are given in A.6, but basically we exploit the cross entropy direction in the primal (since hinge is sparse2) when it offers an improvement in the objective function)
  2. Nestorov Momentum (i.e., taking some of the gradient from the previous step)

Experiments

This paper presents its first results of the following SotA neural network architectures:

  1. Wide Residual Networks (WRN)
  2. Densely Connected Convolutional Networks (DCN)

These are then trained on CIFAR-10 and 100. It can be seen that overall the DFW algorithm almost always outperforms (in some cases greatly) adaptive learning rate algorithms, and in the one case it doesn’t, the difference in performance is negligible. Furthermore, in 50% of cases it outperforms the hand-tuned SGD (i.e., the golden standard).

The next results are on an NLP task ivolving sentiment classification using RNNs. Again DFW performs best here, albeit it’s much tighter (+/-0.2%).

Importance of the Step-Size

Two main points here:

  1. High learning rates when using approximate methods offer implicit regularisation which aids generalisation. This explains why DFW and SGD work well, since initial learning rates are usually high.
  2. DFW doesn’t do so well with higher values of $\eta$ compared to SGD, as the automatic calculation of the learning rate may not decay since the local approximation is poor. This is allevaited by increasing batch size to increase proximal accuracy.

Conclusion

  1. Lacoste-Julien et al. (2013). Block-Coordinate Frank-Wolfe Optimization for Structural SVMs  2

  2. Berrada et al. (2018). Smooth Loss Functions for Deep Top-k Classification