Why stochastic gradient descent works

The Robbins-Siegmund theorem on almost supermartingales

Last updated on May 26, 2024 8 min read

In 2003, while I was a PhD student, I spent three months at Stanford, visiting David Siegmund, and it was during these months I made the most important discovery for my PhD. Using the Azuma-Hoeffding inequality for (super)martingales I found a bound on some tail probabilities that ultimately led to the paper Local alignment of Markov chains. It was three wonderful months, and I have since then paid David and his wife a visit and enjoyed their outstanding hospitality whenever possible. This post is dedicated to David, who was a great inspiration, and who patiently listened and took a real interest in my work when I was a young researcher.

Until recently, I was oblivious to David’s 1971-contribution, with his doctoral advisor Herbert Robbins, on stochastic learning algorithms. By introducing the notion of an almost supermartingale they unified a number of convergence proofs - including that of the original stochastic approximation algorithm from 1951 by Robbins and Monro.

Stochastic learning has become a de facto standard for big data learning, which can be theoretically justified by a reference to the “Robbins-Siegmund theorem”. In this post I will explain what on earth an almost supermartingale is, what the theorem says precisely, and how it can be used to support stochastic gradient descent algorithms for learning.

Almost supermartingales

The objective is to learn, or estimate, a parameter $\theta$ in a parameter space $\Theta \subseteq \mathbb{R}^p$ from data. Some iterative learning algorithms are deterministic. Given data and an initial parameter value, they produce a deterministic sequence of parameters that (hopefully) converges. Stochastic learning algorithms generate, on the contrary, a sequence of random variables in the parameter space. Learning theory establishes convergence properties of such stochastic algorithms, for instance via the Robbins-Siegmund theorem presented below.

For the sake of exposition, let’s first consider the case where $X_1, X_2, \ldots \in \mathbb{R}^p$ are i.i.d. with mean $\theta^* $ and covariance matrix $I$. The objective is thus to learn the mean value $\theta^* \in \mathbb{R}^p$. With $\theta_0 = 0$ we define the online updates

\begin{align} \theta_n = \frac{n-1}{n} \theta_{n-1} + \frac{1}{n} X_n.\tag{1} \end{align}

Now $\theta_n = \frac{1}{n} \sum_{k=1}^n X_k$ is simply a sample average, and we have that \begin{align} E \| \theta_n - \theta^* \|^2 & = \frac{p}{n} \rightarrow 0. \tag{2} \end{align}

We are interested in strengthening the convergence to be almost sure. Let’s introduce the nonnegative function

$$V(\theta) = \|\theta - \theta^*\|^2,$$

so that the sequence $V_n = V(\theta_n)$ quantifies how far away $\theta_n$ is from $\theta^*$. With $\mathbb{F}_n = \sigma(X_1, \ldots, X_n)$ the $\sigma$-algebra generated by $X_1, \ldots, X_n$,

$$E((\theta_{n-1} - \theta^{*})^\top (X_n - \theta^{*}) \mid \mathbb{F}_{n-1}) = (\theta_{n-1} - \theta^{*})^\top E( X_n - \theta^{*}) = 0$$

since $X_n$ is independent of $X_1, \ldots, X_{n-1}$, and

\begin{align*} E(V_n \mid \mathbb{F}_{n-1}) & = E\left( \left\| \frac{n-1}{n}(\theta_{n-1} - \theta^{*}) + \frac{1}{n}(X_n - \theta^{*}) \right\|^2 \ \bigg| \ \mathbb{F}_{n-1} \right) \\
& = \left(\frac{n-1}{n}\right)^2 \|\theta_{n-1} - \theta^{*} \|^2 + \frac{1}{n^2}E(\|X_n - \theta^{*}\|^2) \\
& = \left(\frac{n-1}{n}\right)^2 V_{n-1} + \frac{p}{n^2} \\
& \leq V_{n-1} + \frac{p}{n^2}. \end{align*}

The nonnegative process $V_n$ is thus a supermartingale but for the term $p / n^2$, and was it not for this term, Doob’s martingale convergence theorem would together with (2) imply that $V_n \rightarrow 0$ almost surely. The term $p / n^2$ is small and diminishing as $n \to \infty$ so $V_n$ is almost a supermartingale, and it is conceivable that the conclusion from the martingale convergence theorem can be maintained.

It was the key insight of Siegmund and Robbins that we can obtain a powerful convergence theorem for nonnegative processes that are almost supermartingales in the sense that

\begin{align} E(V_n \mid \mathbb{F}_{n-1} ) \leq (1 + \beta_{n-1}) V_{n-1} + \xi_{n-1} - \zeta_{n-1} \tag{3} \end{align}

for $\beta_n, \xi_n, \zeta_n \geq 0$ being $\mathbb{F}_n$-measurable.

Theorem (Robbins-Siegmund): Suppose $V_n$ is a nonnegative almost supermartingale. If $$\sum_{n=1}^{\infty} \beta_n < \infty \quad \text{and} \quad \sum_{n=1}^{\infty} \xi_n < \infty \qquad \text{almost surely},$$ then there exists a nonnegative random variable $V_{\infty}$ such that $$\lim_{n \to \infty} V_n = V_{\infty} \quad \text{and} \quad \sum_{n=1}^{\infty} \zeta_n < \infty \qquad \text{almost surely}.$$

We see immediately that for $\theta_n$ defined by the recursion (1), $V_n = V(\theta_n)$ becomes an almost supermartingale with $\beta_n = 0$, $\xi_n = p / (n + 1)^2$ and $\zeta_n = 0$. Since \[ \sum_{n=1}^{\infty} \frac{p}{(n+1)^2} < \infty \] the Robbins-Siegmund theorem gives the almost sure convergence of $V_n$ toward a limit $V_{\infty} \geq 0$. By (2), $E(V_{\infty}) = 0$ and $V_{\infty} = 0$ almost surely. In other words, $\| \theta_n - \theta^*\| \to 0$ for $n \to \infty$ almost surely, that is

$$\theta_n = \frac{1}{n} \sum_{k=1}^n X_k \rightarrow \theta^*$$

almost surely. Thus the strong law of large numbers is a simple consequence of the Robbins-Siegmund theorem.

Stochastic gradient algorithms

Let’s turn our attention to the minimization, over an open parameter set $\Theta$, of the objective function $$H(\theta) = E(L(X, \theta))$$ for a loss function $L$. We assume that $\nabla H (\theta) = E( \nabla L(X, \theta))$ and that \[ G(\theta) := E(\| \nabla L(X, \theta) \|^2) \leq A + B \|\theta\|^2. \] We also assume that $\nabla H(\theta^*) = 0$ and that

\[ (\theta - \theta^* )^\top \nabla H (\theta) \geq c \| \theta - \theta^* \|^2, \]

which implies that $\theta^* $ is a unique minimizer of $H$. This last condition holds whenever $H$ is strongly convex, but we only need the inequality to hold in $\theta^* $.

The stochastic gradient algorithm is given by $$ \theta_{n} = \theta_{n - 1} - \gamma_{n-1} \nabla L(X_n, \theta_{n-1})$$ for an i.i.d. sequence $X_1, X_2, \ldots$ of random variables. The learning rate $\gamma_{n-1}$ is allowed to depend on $X_1, \ldots, X_{n-1}$ and $\theta_0, \ldots, \theta_{n-1}$. We find from this that

\begin{align} V_n & = \|\theta_{n} - \theta^* \|^2 \\ & = \|\theta_{n-1} - \theta^* - \gamma_{n-1} \nabla L(X_n, \theta_{n-1})\|^2 \\
& = V_{n-1} + \gamma_{n-1}^2 \|\nabla L(X_n, \theta_{n-1})\|^2 - 2 \gamma_{n-1} (\theta_{n-1} - \theta^*)^\top \nabla L(X_n, \theta_{n-1}). \end{align}

Taking conditional expectation gives

\begin{align} E(V_n \mid \mathbb{F}_{n-1}) & = V_{n-1} + \gamma_{n-1}^2 E(\|\nabla L(X_n, \theta_{n-1})\|^2 \mid \mathbb{F}_{n-1}) \\ & \qquad - 2 \gamma_{n-1} (\theta_{n-1} - \theta^*)^\top E( \nabla L(X_n, \theta_{n-1}) \mid \mathbb{F}_{n-1}) \\
& = V_{n-1} + \gamma_{n-1}^2 G(\theta_{n-1}) - 2 \gamma_{n-1} (\theta_{n-1} - \theta^{*})^\top \nabla H(\theta_{n-1}) \\
& \leq V_{n-1} + \gamma_{n-1}^2(A + B \|\theta_{n-1}\|^2) - 2 c \gamma_{n-1} \| \theta_{n-1} - \theta^{*}\|^2. \end{align}

Now observe that

\begin{align*} \|\theta_{n-1}\|^2 & = \|\theta_{n-1} - \theta^* + \theta^* \|^2 \\
& \leq \left(\|\theta_{n-1} - \theta^* \| + \|\theta^* \| \right)^2 \\
& \leq 2\|\theta_{n-1} - \theta^* \|^2 + 2 \|\theta^* \|^2 \\
& = 2 V_{n-1} + 2\|\theta^* \|^2 , \end{align*}

and we get

\begin{align} E(V_n \mid \mathbb{F}_{n-1}) & \leq V_{n-1} + \gamma_{n-1}^2(A + 2 B V_{n-1} + 2 B \|\theta^* \|^2) - 2 c \gamma_{n-1} V_{n-1} \\
& = (1 + 2 B \gamma_{n-1}^2 ) V_{n-1} + \gamma_{n-1}^2(A + 2 B \|\theta^* \|^2) - 2 c \gamma_{n-1} V_{n-1}. \end{align}

This shows that $V_n$ is an almost supermartingale with $\beta_n = 2 B \gamma_{n}^2$, $\xi_n = \gamma_{n}^2(A + 2 B \|\theta^* \|^2)$ and $\zeta_n = c \gamma_{n} V_{n}$.

Using the Robbins-Siegmund theorem we can prove the following.

Theorem (convergence of SG algorithms): With the regularity assumptions on the loss function $L$ above, the process $V_n$ converges almost surely to a limit $V_{\infty}$ if $$\sum_{n=1}^{\infty} \gamma_n^2 < \infty \quad \text{almost surely}.$$ If also $$\sum_{n=1}^{\infty} \gamma_n = \infty \quad \text{almost surely}$$ the limit is $V_{\infty} = 0$, and $$\lim_{n \to \infty} \theta_n = \theta^* \quad \text{almost surely.}$$

Proof: The first part follows directly from the Robbins-Siegmund theorem. For the second part, assume that $V_{\infty} > 0$ on a set of positive probability. There is thus a random variable $N$ such that on that set $V_n \geq V_{\infty}/2$ for $n \geq N$, and $$\sum_{n=1}^{\infty} \zeta_n = c \sum_{n=1}^{\infty} \gamma_n V_n \geq \frac{c V_{\infty}}{2} \sum_{n=N}^{\infty} \gamma_n = \infty$$ with positive probability. This is a contradiction with the Robbins-Siegmund theorem. We conclude that $V_{\infty} = 0$ almost surely, thus $V_n = \| \theta_n - \theta^* \|^2 \rightarrow 0$ almost surely, or $$\theta_n \rightarrow \theta^*$$ almost surely for $n \to \infty$.

Clearly, a global convexity assumption on $H$ is a strong assumption, but we can expect the algorithm to behave similarly in a neighborhood of a local minimizer $\theta^* $ if just $H$ is strongly convex in this neighborhood.

The most important insight we get from the theoretical result is the conditions on the learning rate:

$$\sum_{n=1}^{\infty} \gamma_n^2 < \infty \quad \text{and} \quad \sum_{n=1}^{\infty} \gamma_n = \infty.$$

The learning rate must tend to zero to make the algorithm converge, and the first of these conditions tells us how fast it has to converge. This condition is fulfilled for, e.g., $\gamma_n = e^{-n}$, which would ensure $$\| \theta_n - \theta^* \|^2 \rightarrow V_{\infty},$$ but as the second condition doesn’t hold for this learning rate, we don’t know if $V_{\infty} = 0$. The second condition tells us that the learning rate cannot tend to zero too fast if we want to be sure that the algorithm also reaches $\theta^*$ eventually.

Martingales everywhere

It is well known that (super)martingales are extremely useful for deriving convergence results or proving probabilistic bounds. Given a concrete probabilistic problem, it is a widely used strategy to search for an appropriate (super)martingale that will allow one to derive the desired result. It is not always easy to find such a (super)martingale. It took me some months of work during my PhD to find the martingale that made it possible to show the probabilistic bound in Theorem 5.7 in Local alignment of Markov chains. But the search for (super)martingales often pays off.

I was delighted to discover the Robbins-Siegmund theorem recently and thereby expand my repertoire of convergence theorems. Not only is the theorem itself a beautiful result proved by David when he was a young researcher, but it also extends the applicability of martingale methods to situations where you just can’t find the (super)martingale – but almost can. I hope to be able to use the theorem in my future research, and I also hope that this post will make the concept of almost supermartingales and the Robbins-Siegmund theorem more widely known.

Why stochastic gradient descent works

Almost supermartingales

Stochastic gradient algorithms

Martingales everywhere

Niels Richard Hansen

Professor of Computational Statistics