Overview of Gradient Descent Optimizers

SGD (Momentum), RMSProp, Adam

Feb 08, 2025

Introduction

This post contains summarization and notes from the following papers:

Adam: A Method for Stochastic Optimization (link)
An overview of gradient descent optimization algorithms (link)

Gradient descent is a core algorithm to perform optimization in Machine Learning. It is a method to minimize an objective function f(θ) (usually f(θ) ∈ R) by updating the parameters in the opposite direction of its gradient. There are a few different gradient descent optimization algorithms such as stochastic gradient descent (SGD), SGD with momentum, RMSProp and Adam. As of 2025, Adam is generally considered the default choice in ML.

Gradient Descent

The vanilla gradient descent is optimized w.r.t entire training dataset:

\(\theta=\theta-\gamma\cdot\nabla_\theta f(\theta)\)

where γ is the learning rate. However, optimizing over the entire dataset at once is impractical for large datasets. In practice, we don’t use entire dataset but mini-batch of the dataset in one step, so we have:

\(\theta=\theta-\gamma \cdot\nabla_\theta f\left(\theta;x[i:i+n]; y[i:i+n]\right)\)

The above variation of the gradient descent is called mini-batch gradient descent, where we pick the mini-batch size n depending on the application and hardware constraints. The variation is called stochastic gradient descent (SGD) when n equals 1. In practice, the terms stochastic gradient descent and mini-batch gradient descent are often used interchangeably.

Algorithms

SGD with Momentum

Stochastic gradient descent can get stuck at local minima or spend too much time following local noisy gradients. The stochastic gradient descent with momentum (SGD with momentum) helps to escape the local noisy gradients by smoothing the gradients over time:

\(\begin{align} m_t&=\mu \cdot m_{t-1}+\nabla_\theta f(\theta)\\ \theta&=\theta-\gamma \cdot m_t \end{align}\)

We update the parameters with the exponentially decaying average of past gradients, which helps avoid small, noisy gradients and focus on overall trends.

RMSProp

Root Mean Square Propagation (RMSProp) is an algorithm that dynamically adopts the learning rate for every parameter θ[i], unlike SGD where we use the same learning rate γ for all parameters. This helps to prevent oscillations, as gradient descent might otherwise make a big move in one direction and a small move in another.

\(\begin{align} g_t&=\nabla_{\theta}f(\theta)\\ \nu_t&=\beta\cdot \nu_{t-1}+(1-\beta)\cdot g_t^2\\ \theta&=\theta-\gamma \frac{g_t}{\sqrt{v_t}+\epsilon} \end{align}\)

We keep track of the exponentially decaying average of past squared gradients (or uncentered variance), then use this tracked squared gradient to normalize the current gradient.

Adam

Adam is essentially a combination of SGD with momentum and RMSProp. Adam also improves the both method by introducing initialization bias correction. Adam without a bias correction:

\( \begin{align} g_t&=\nabla_{\theta}f(\theta)\\ m_t&=\beta_1\cdot m_{t-1} + (1-\beta_1)\cdot g_t\\ \nu_t&=\beta_2\cdot \nu_{t-1}+(1-\beta_2)\cdot g_t^2\\ \theta&=\theta-\gamma \frac{m_t}{\sqrt{v_t}+\epsilon} \end{align} \)

Since m[0] and v[0] are initialized with zero, m[t] and v[t] are biased towards zero, especially during the initial steps. The paper explains this by inspecting the expectation value of v[t]. But, first we need to unroll v[t]:

\(\begin{align} \nu_t&=\beta_2\cdot \nu_{t-1}+(1-\beta_2)\cdot g_t^2\\ &=\beta_2( \beta_2\cdot \nu_{t-2}+(1-\beta_2)\cdot g_{t-1}^2 ) + (1-\beta_2)\cdot g_t^2\\ &=\beta_2^2\cdot \nu_{t-2}+\beta_2\cdot(1-\beta_2)\cdot g_{t-1}^2+(1-\beta_2)\cdot g_t^2\\ &=\cdots\\ &=(1-\beta_2)\sum_{i=1}^t{\beta_2^{t-i}\cdot g_i^2} \end{align}\)

Then, we have:

\(\begin{align} \mathbb{E}[\nu_t]&=\mathbb{E}\left[(1-\beta_2)\sum_{i=1}^t{\beta_2^{t-i}\cdot g_i^2} \right]\\ &=\mathbb{E}[g_t^2] \cdot (1 - \beta_2)\sum_{i=1}^t\beta_2^{t-i} + \zeta\\ &=\mathbb{E}[g_t^2] \cdot (1 - \beta_2)\frac{1-\beta_2^t}{1 - \beta_2} + \zeta\\ &=\mathbb{E}[g_t^2] \cdot (1-\beta_2^t) + \zeta \end{align}\)

It is not pretty clear from the paper, but we’re assuming that:

\(\mathbb{E}[g_t] \sim \mathbb{E}[g_i]\)

In that case ζ is close to zero, which clearly shows the bias (1-beta2^t) that we need to divide v[t] by. We can assume ζ is close to zero, as this bias correction is mostly required in the initial steps since (1-beta2^t) approaches 1 for large t values.

Finally, Adam with bias correction is:

\(\begin{align} g_t&=\nabla_{\theta}f(\theta)\\ m_t&=\beta_1\cdot m_{t-1} + (1-\beta_1)\cdot g_t\\ \nu_t&=\beta_2\cdot \nu_{t-1}+(1-\beta_2)\cdot g_t^2\\ \hat{m_t}&=m_t / (1-\beta_1^t)\\ \hat{v_t}&=v_t / (1-\beta_2^t)\\ \theta&=\theta-\gamma \frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon} \end{align} \)

The End

I hope you enjoyed this post.

Madiyar's Page

Discussion about this post