Why FTRL Is Better than Online Mirror Descent

Posted on

15 minute read

Many online learning algorithms come in two flavors: you can either choose the online mirror descent (OMD) version or the follow-the-regularized-leader (FTRL) = dual-averaging version. In practice people usually go for OMD, which makes sense from a theoretical point of view because it can easily adapt to the size of the gradients, i.e. it is Lipschitz-adaptive. This comes with a big downside, however, because OMD may perform (very!) poorly when your intermediate iterates happen to wander away from the optimal parameters for a while, as may easily happen with non-stationary data. FTRL, on the other hand, is much more robust to non-stationarity, but it is not Lipschitz-adaptive by default according to its standard analysis. What should we do? It turns out there are actually two easy fixes to get the best of both worlds, which both build off of FTRL: the first fix is to use a refinement of the standard analysis due to Orabona and Pál to show that FTRL can be made Lipschitz-adaptive after all; and the second fix is to combine FTRL with a technique that we have started calling Cutkosky clipping in my group, after its inventor Ashok Cutkosky. Read on for an overview of the two approaches, which are both worth having in your theoretical toolbox.

Setting: Online Convex Optimization

We consider the following online learning setting, which is called \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\argmin}{arg\,min} \DeclareMathOperator{\diameter}{diameter} \DeclareMathOperator{\proj}{Proj} \newcommand{\u}{\boldsymbol u} \newcommand{\v}{\boldsymbol v} \newcommand{\w}{\boldsymbol w} \newcommand{\grad}{\boldsymbol g} \newcommand{\clipgrad}{\bar \grad} \newcommand{\tf}{\tilde f} \newcommand{\tR}{\tilde R} \newcommand{\domain}{\mathcal{W}} \newcommand{\reals}{\mathbb{R}} \newcommand{\Dmax}{D_\text{max}} \newcommand{\Gmax}{G_\text{max}} \newcommand{\Done}{D_1}\) Online Convex Optimization (OCO): Each round \(t=1,\ldots,T\), a learner predicts a point \(\w_t\) from a convex domain \(\domain \subset \reals^d\) of diameter at most \(\diameter(\domain) \leq D\). The environment then selects a convex loss function \(f_t : \domain \to \reals\) and the learner incurs loss \(f_t(\w_t)\). Most online learning algorithms update their predictions efficiently, using only the gradient \(\grad_t = \nabla f_t(\w_t)\), and the objective is for the learner to control the regret

\[R_T(\u) = \sum_{t=1}^T \big(f_t(\w_t) - f_t(\u)\big)\]

for all \(\u \in \domain\) simultaneously.

Two Flavors of Online Gradient Descent

For simplicity, I will focus on online gradient descent (OGD), but the same discussion equally applies to other instances of OMD and FTRL. I am also considering here the version of FTRL with linearized losses, which makes it line up well with OMD. Given a starting point \(\w_1 = 0\) and a sequence of stepsizes \(\eta_1,\eta_2,\ldots\), the two versions of OGD are defined as follows:

OMD Version: \(\hspace{3em} \begin{align*} \w_{t+1} &= \argmin_{w \in \domain} \w^\top \grad_t + \frac{1}{2\eta_t} \|\w - \w_t\|^2\\ &= \proj(\w_t - \eta_t \grad_t) \end{align*}\)

FTRL Version: \(\hspace{5em} \begin{align*} \w_{t+1} &= \argmin_{w \in \domain} \sum_{s=1}^t \w^\top \grad_s + \frac{1}{2\eta_t} \|\w - \w_1\|^2\\ &= \proj(\w_1 - \eta_t \sum_{s=1}^t \grad_s), \end{align*}\)

where \(\proj(\tilde \w) = \argmin_{\w \in \domain} \|\w - \tilde \w\|\) is the Euclidean projection of \(\tilde \w\) onto the allowed domain. (All norms in this post are standard Euclidean norms.)

The difference between the two versions is that OMD updates from the previous iterate \(\w_t\) using only the latest gradient \(\grad_t\), whereas FTRL updates from the starting point \(\w_1\) using all the gradients. Computationally, both versions are equally efficient, because you end up storing a sum of gradients in both cases.

In fact, if the step size \(\eta_t = \eta\) is constant and there are no projections, then the two methods even coincide! This is highly suboptimal in practice, however, where we always want to use decreasing step sizes. To illustrate this, suppose we would set \(\eta \leq \text{constant}/\sqrt{T}\), then the algorithm would essentially be stuck at its starting point \(\w_1\) for roughly the first \(\sqrt{T}\) rounds. But if we set \(\eta \gg 1/\sqrt{T}\), then the iterates can become too unstable for large \(t\), and suffer large regret as a result.

Regret Bounds

On to regret bounds! The regret of OGD has been extensively analyzed in the literature. For instance, from the proof of Theorem 1 in Zinkevich’s well-known paper it can be seen that the OMD version satisfies the following guarantee:

Theorem 1 (OMD Version): For any non-increasing step sizes \(\eta_1\geq \eta_2 \geq \ldots\), the OMD version of OGD guarantees regret at most

\[\begin{align}\label{eqn:mdbound} R_T(\u) &\leq \frac{\|\w_1 - \u\|^2}{2\eta_1} + \Big(\frac{1}{2\eta_T} - \frac{1}{2\eta_1}\Big) \max_{t \in \{2,\ldots,T\}}\|\w_t - \u\|^2 + \frac{1}{2} \sum_{t=1}^T \eta_t \|\grad_t\|^2\\ &\leq \frac{1}{2\eta_T} \max_{t \in \{1,\ldots,T\}}\|\w_t - \u\|^2 + \frac{1}{2} \sum_{t=1}^T \eta_t \|\grad_t\|^2.\label{eqn:mdbound2} \end{align}\]

Consequently, if we set \(\eta_t = \frac{\Dmax}{\sqrt{\sum_{s=1}^t \|\grad_s\|^2}}\) for \(\Dmax \geq \max_{t \in \{1,\ldots,T\}} \|\w_t - \u\|\), then

\[R_T(\u) \leq \frac{3}{2} \Dmax \sqrt{\sum_{t=1}^T \|\grad_t\|^2} \leq \frac{3}{2} \Dmax \Gmax \sqrt{T},\]

where \(\Gmax = \max_{t \in \{1,\ldots,T\}} \|\grad_t\|\).

A good reference to look up the standard analysis of FTRL is Corollary 7.9 in Orabona’s very nice introduction to online learning, which specializes as follows to OGD:

Theorem 2 (FTRL Version): For any non-increasing step sizes \(\eta_1\geq \eta_2 \geq \ldots\), the FTRL version of OGD guarantees regret at most

\[\begin{equation}\label{eqn:ftrlbound} R_T(\u) \leq \frac{\|\w_1 - \u\|^2}{2\eta_{T-1}} + \frac{1}{2} \sum_{t=1}^T \eta_{t-1} \|\grad_t\|^2. \end{equation}\]

Consequently, if we set \(\eta_t = \frac{\Done}{\sqrt{\Gmax^2 + \sum_{s=1}^t \|\grad_s\|^2}}\) for \(\Done \geq \|\w_1 - \u\|\) with \(\Gmax\) as in Theorem 1, then

\[R_T(\u) \leq \frac{3}{2} \Done \sqrt{\Gmax^2 + \sum_{t=1}^{T-1} \|\grad_t\|^2} \leq \frac{3}{2} \Done \Gmax \sqrt{T}.\]

Two Big Differences

Comparing \(\eqref{eqn:mdbound}\) and \(\eqref{eqn:ftrlbound}\), we see that the bounds coincide when \(\eta_t = \eta\) is a constant, but they start to differ significantly in the common case that the step sizes are decreasing. The two main differences between Theorems 1 and 2 are as follows:

  1. Advantage of FTRL over OMD: \(\Done \ll \Dmax\)

Comparing the two theorems, we see that the guarantee for OMD depends on \(\Dmax\) whereas the guarantee for FTRL only depends on \(\Done \leq \Dmax\). This matters, because, after tuning the step sizes, \(\Done\) and \(\Dmax\) end up as multiplicative factors in front of the regret bounds, so they have a big impact on the regret. To see that the difference can be significant, note that, if the domain \(\domain\) is large compared to \(\Done\) (or even unbounded) and the data are non-stationary, then the intermediate iterates can move all over the place and consequently we will have that \(\Dmax \gg \Done\), implying that OMD is much worse than FTRL. Moreover, we can directly influence \(\Done\) by changing the starting position \(\w_1\) of the algorithm, but \(\Dmax\) is not easy to control, because it depends on all the iterates of the algorithm, which are heavily dependent on the data. Now we might still hope that the difference between \(\Done\) and \(\Dmax\) is just an artifact of the bounds and that things won’t be as bad in practice, but this hope turns out to be vain, because Orabona and Pál (Theorem 3) construct a counterexample where the difference is real.

  1. Advantage of OMD over FTRL: Lipschitz-adaptivity

There is another difference between Theorems 1 and 2, which is so subtle that it is easily overlooked when reading the literature: compared to \(\eqref{eqn:mdbound}\) the step sizes in the sum \(\sum_{t=1}^T \eta_{t-1} \|\grad_t\|^2\) in \(\eqref{eqn:ftrlbound}\) are one time step behind.1 At first sight, this may seem like a minor difference, but it is in fact crucial if we do not know the sizes of our gradients in advance (and in practice we never do!). To see this, note that for OMD the sum can be controlled by setting

\[\eta_t \propto \frac{1}{\sqrt{\sum_{s=1}^t \|\grad_s\|^2}}.\]

But getting the same control in FTRL would require

\[\eta_t \propto \frac{1}{\sqrt{\sum_{s=1}^{t+1} \|\grad_s\|^2}},\]

which is impossible because \(\|\grad_{t+1}\|\) is unknown at the time of choosing \(\eta_t\). The solution shown in Theorem 2 is to fall back on

\[\eta_t \propto \frac{1}{\sqrt{\Gmax^2 + \sum_{s=1}^{t} \|\grad_s\|^2}},\]

which replaces \(\|\grad_{t+1}\|\) by the upper bound \(\Gmax\). But this is not really a solution at all, because it leaves us stuck with \(\Gmax\) as a hyper-parameter that we need to set. Now a seasoned machine learning practitioner may not easily be intimidated by having to tune yet another hyper-parameter, but \(\Gmax\) is particularly nasty. Indeed, \(\Gmax\) is a moving target, because changing its estimate changes the iterates \(\w_t\) of the algorithm, which in turn changes the gradients \(\grad_t\) that it encounters, which changes \(\Gmax\)!

Solutions

Are we then to live with the choice between a nasty dependence on \(\Dmax\) (for OMD) or an awful hyper-parameter \(\Gmax\) (for FTRL)? Luckily not, and there are even two possible ways out! In line with the title of this post, both of these build on FTRL, by resolving the hyper-parameter tuning issue.

Solution 1: Improve the Analysis of FTRL

The first solution, which is perhaps the most comforting, is due to Orabona and Pál. It turns out that there is really nothing wrong with FTRL per se, but the problem arises only in the analysis. As shown in the proof of Orabona and Pál’s Theorem 1, it is possible to refine \(\eqref{eqn:ftrlbound}\) as follows:

\[\begin{align} R_T(\u) &\leq \frac{\|\w_1 - \u\|^2}{2\eta_T} + \frac{1}{2} \sum_{t=1}^T \min\Big\{\eta_{t-1} \|\grad_t\|^2, \eta_t \|\grad_t\|^2 + 2\Dmax^* \|\grad_t\|\Big\},\notag\\ &\leq \frac{\|\w_1 - \u\|^2}{2\eta_T} + \frac{1}{2} \sum_{t=1}^T \eta_t \|\grad_t\|^2 + \frac{1}{2} \sum_{t=1}^T \min\Big\{\eta_{t-1} \|\grad_t\|^2, 2\Dmax^* \|\grad_t\|\Big\}, \label{eqn:opbound} \end{align}\]

where \(\Dmax^* = \max_{\v,\w \in \domain} \|\v - \w\|\). The result of the new minimum in the bound is that the effect of not including \(\|\grad_{t+1}\|\) in \(\eta_t\) is mitigated, and that we can get away with step sizes

\[\eta_t = \frac{\Done}{\sqrt{\sum_{s=1}^t \|\grad_s\|^2}}.\]

Plugging this into \(\eqref{eqn:opbound}\), a standard summation analogous to the one used in Theorem 1 leads to

\[R_T(\u) \leq \frac{3}{2} \Done \sqrt{\sum_{t=1}^T \|\grad_t\|^2} + \frac{1}{2} \sum_{t=1}^T \min\Big\{\eta_{t-1} \|\grad_t\|^2, 2\Dmax^* \|\grad_t\|\Big\}.\]

To handle the remaining sum, Orabona and Pál provide the following useful inequality (their Lemma 3):

Lemma 3: Let \(C, a_1, a_2, \ldots, a_T \geq 0\). Then

\[\sum_{t=1}^T \min \Big\{\frac{a_t^2}{\sqrt{\sum_{s=1}^{t-1} a_s^2}}, C a_t\Big\} \leq 3.5 \sqrt{\sum_{t=1}^T a_t^2} + 3.5 C \max_{t \in \{1,\ldots,T\}} a_t.\]

Applying this with \(a_t = \|\grad_t\|\) and \(C = 2\Dmax^*/\Done\), we find that

\[\begin{align} R_T(\u) &\leq \frac{3}{2} \Done \sqrt{\sum_{t=1}^T \|\grad_t\|^2} + \frac{\Done}{2} \Big( 3.5 \sqrt{\sum_{t=1}^T \|\grad_t\|^2} + 7 \frac{\Dmax}{\Done} \Gmax \Big)\notag\\ &= \frac{13}{4} \Done \sqrt{\sum_{t=1}^T \|\grad_t\|^2} + \frac{7}{2} \Dmax \Gmax.\label{eqn:refinedbound} \end{align}\]

This is great, because we tuned the step sizes without prior knowledge of the gradient lengths, so we got Lipschitz-adaptivity, and the leading part of the bound also scales with \(\Done\) as desired! The fact that \(\Dmax\) does appear in the constant term at the end is also fine as long as our domain is not exceedingly large.2 The only downside is that our leading constant has increased by slightly more than a factor of \(2\) from \(3/2\) to \(13/4\). Are we being too greedy if we complain about such a minor remaining imperfection? Certainly not, because even this increase can be avoided using the second solution presented below!

Solution 2: Cutkosky Clipping

The second solution does not require any improvement to the analysis of Theorem 2. Instead, we will combine \(\eqref{eqn:ftrlbound}\) with a technique to clip the observed gradients, which we call Cutkosky clipping. I first learned about it from Ashok Cutkosky on the bus at COLT 2018, and he later published it in a paper about combining Lipschitz adaptivity with unbounded domains. The technique is both amazingly simple and super useful to have in your toolbox at the same time, and I remember being blown away that such a simple idea had not been discovered before.

To explain the approach, let \(G_t = \max_{s \leq t} \|\grad_s\|\) with boundary case \(G_0 = 0\), and define the clipped gradients

\[\clipgrad_t = \frac{G_{t-1}}{G_t} \grad_t = \begin{cases} \grad_t & \text{if $\|\grad_t\| \leq G_{t-1}$}\\ G_{t-1} \frac{\grad_t}{\|\grad_t\|} & \text{otherwise.} \end{cases}\]

These are just the original gradients, but with their lengths truncated to the previous largest observed length \(G_{t-1}\). The result is that, for \(\clipgrad_t\), we know one time-step in advance what its maximum length is going to be.

How does this help us? Well, many online learning algorithms, including the FTRL version of OGD, are based on linear approximations \(\tf_t(\w) = \w^\top \grad_t\) of the losses, which enter in the analysis as

\[f_t(\w_t) - f_t(\u) \leq \tf_t(\w_t) - \tf_t(\u) \qquad \text{for any convex $f_t$}.\]

Summing over \(t\), it follows that

\[R_T(\u) \leq \tR_T(\u) := \sum_{t=1}^T \big(\tf_t(\w_t) - \tf_t(\u)\big),\]

and all the bounds from Theorems 1 and 2 are also upper bounds on \(\tR_T(\u)\). Moreover, these bounds do not make any assumptions about where the vectors \(\grad_t\) are coming from, as long as the same vectors are used consistently in the algorithms and their analysis. But this means that we can also apply the algorithms and bounds to the clipped gradients \(\clipgrad_t\). Keeping track of the approximation error when replacing \(\grad_t\) by \(\clipgrad_t\), inequality \(\eqref{eqn:ftrlbound}\) for the FTRL version of OGD then gives us

\[\begin{align*} R_T(\u) &\leq \tR_T(\u) = \sum_{t=1}^T \Big(\w_t^\top \clipgrad_t - \u^\top \clipgrad_t\Big) + \sum_{t=1}^T (\w_t - \u)^\top (\grad_t - \clipgrad_t)\\ &\leq \frac{\|\w_1 - \u\|^2}{2\eta_{T-1}} + \frac{1}{2} \sum_{t=1}^T \eta_{t-1} \|\clipgrad_t\|^2 + \sum_{t=1}^T (\w_t - \u)^\top (\grad_t - \clipgrad_t). \end{align*}\]

Since the lengths of the clipped gradients satisfy \(\|\clipgrad_t\| \leq G_{t-1}\), we can tune the step sizes as

\[\begin{equation}\label{eqn:clipstep} \eta_t = \frac{D_1}{\sqrt{G_t^2 + \sum_{s=1}^t \|\clipgrad_s\|^2}} \end{equation}\]

to get

\[\frac{\|\w_1 - \u\|^2}{2\eta_{T-1}} + \frac{1}{2} \sum_{t=1}^T \eta_{t-1} \|\clipgrad_t\|^2 \leq \frac{3}{2} D_1 \sqrt{G_{T-1}^2 + \sum_{t=1}^{T-1} \|\clipgrad_t\|^2} \leq \frac{3}{2} D_1 \Gmax \sqrt{T}.\]

It remains to control the overhead introduced by clipping, which can be done as follows:

\[\begin{align*} \sum_{t=1}^T (\w_t - \u)^\top (\grad_t - \clipgrad_t) &= \sum_{t=1}^T (\w_t - \u)^\top \grad_t \Big(1 - \frac{G_{t-1}}{G_t}\Big)\\ &\leq \Dmax \sum_{t=1}^T \|\grad_t\| \Big(1 - \frac{G_{t-1}}{G_t}\Big) \leq \Dmax \sum_{t=1}^T G_t \Big(1 - \frac{G_{t-1}}{G_t}\Big)\\ &= \Dmax \sum_{t=1}^T \Big(G_t - G_{t-1}\Big) = \Dmax \Gmax. \end{align*}\]

Putting everything together, we find that:

Theorem 4: Running the FTRL version of OGD on the clipped gradients \(\clipgrad_t\) with step sizes as in \(\eqref{eqn:clipstep}\) guarantees regret at most

\[R_T(\u) \leq \frac{3}{2} D_1 \Gmax \sqrt{T} + \Dmax \Gmax.\]

This is the same as \(\eqref{eqn:refinedbound}\), but now even the constant factors turn out as nice as we could hope for!

Conclusion

We have seen that the FTRL version of OGD is better than the OMD version, because its regret depends only on the distance \(\Done\) of the first iterate from the optimal parameters, as opposed to the maximum distance over all the iterates \(\Dmax\) for OMD. Comparing the standard regret bounds for FTRL and OMD suggests that OMD might still have the advantage in adapting to the sizes of the gradients, but it turns out that this can be remedied for FTRL, either by improving the standard analysis (Solution 1) or by combining the standard analysis with Cutkosky clipping of the gradients (Solution 2).

Skeletons in the Closet

There is one topic that I have completely glossed over so far, which is how to adapt to the parameter \(\Done\) that still appears in the tuning of the step sizes. This is actually not a straightforward issue, but through a long series of works it is by now quite well understood. The upshot and a selection of key references are as follows:

That’s it. Thanks for reading!

  1. The difference between \(\eta_t\) and \(\eta_{t-1}\) is extra difficult to spot in the literature, because different notational conventions sometimes shift the subscript \(t\) on \(\eta_t\) by one. 

  2. Orabona and Pál even provide a refined analysis for unbounded domains, which replaces the dependence on \(\Dmax^*\) by a term of order \(\sqrt{T}\).