Hello! This is my first blog ever!! I’ll write about Adam and AdamW. it’s always good to go back to the basics and brush up on what’s happening under the hood :), so let’s get started.

Background: Adam Optimizer Overview

Adam (Adaptive Moment Estimation) is a popular stochastic optimizer introduced by Kingma and Ba (2014). It combines ideas from momentum and RMSProp to adapt the learning rate for each parameter. Mathematically, Adam maintains an exponentially decaying average of past gradients (first moment) and of past squared gradients (second moment). At each step $t$, for each parameter $\theta$, Adam updates these estimates as:

  • First moment (momentum): $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$,
  • Second moment (RMS): $v_t = \beta_2v_{t-1} + (1-\beta_2)g_t^2$,

where $g_t = \nabla_{\theta} f_t(\theta_{t-1})$ is the current gradient, and $\beta_1,\beta_2$ are decay rates (e.g. $0.9$ and $0.999$ by default). To correct the initialization bias, bias-corrected estimates are computed: $$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$

The final update rule for Adam is given by: $$\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$ Where $\alpha$ is the learning rate and $\epsilon$ a small constant for numerical stability​

This adaptive rule results in per-parameter step sizes inversely proportional to the root mean square of recent gradients, making Adam invariant to gradient rescaling and suited for problems with noisy or sparse gradients​. Adam’s hyperparameters ($\alpha, \beta_1, \beta_2$) have intuitive interpretations and typically require little tuning​. Empirically, Adam often converges faster than stochastic gradient descent (SGD) and has been a default choice in training deep networks​.

Weight Decay vs. L2 Regularization in Adam

Weight decay is a regularization method that penalizes large weights by multiplying weights by a factor (less than 1) each update, effectively “decaying” the weight magnitude over time. In classical SGD, applying weight decay at each step is equivalent to adding an L2 penalty $\frac{\lambda}{2}|\theta|^2$ to the loss function (with $\lambda$ appropriately scaled by the learning rate). In other words, for standard (non-adaptive) SGD, weight decay and L2 regularization are mathematically equivalent when using a constant learning rate.

However, for adaptive optimizers like Adam, L2 regularization and weight decay are not equivalent​ . In most implementations, “weight decay” has been applied by adding an L2 penalty term to the loss or gradient. This means the Adam update effectively incorporates the regularization gradient into $g_t$. For example, if using L2 regularization, one would add $\lambda\theta_{t-1}$ to the gradient:

$$ g_t^{(\mathrm{L} 2)}=\nabla_\theta f_t\left(\theta_{t-1}\right)+\lambda \theta_{t-1} $$

This approach couples the regularization with Adam’s adaptive update: the penalty term $\lambda \theta$ gets scaled by the same factor $\frac{\alpha}{\sqrt{\hat{v}_t}+\epsilon}$ as the data gradient. Consequently, the adaptive learning rates modulate the effect of weight decay in a complex way​. As noted by Loshchilov and Hutter (2017), in Adam “$L_2$ regularization (often called ‘weight decay’ in implementations) is misleading” because it is not equivalent to true weight decay due to this coupling​. The practical implication is that the optimal weight decay (L2) strength in Adam is entangled with the learning rate and gradient history, making it harder to tune and sometimes hindering convergence​.

AdamW: Decoupled Weight Decay

AdamW (Adam with decoupled weight decay) is a modification of Adam proposed by Loshchilov and Hutter (2017) to address the above issue​. The key idea is to decouple the weight decay step from the gradient-based update. Instead of adding $\lambda \theta$ into the gradient (which affects the moment estimates), AdamW applies weight decay directly to the weights after the Adam update. Mathematically, AdamW update can be written as:

  • Gradient step (Adam part): $\hat{\theta_t} = \theta_{t-1} - \alpha\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$ (using the gradient of the loss only, no $\lambda\theta$ term inside).
  • Weight decay step: $\theta_t = \hat{\theta_t} - \alpha\lambda\theta_{t-1}$.

Combining these, the one-step update is often expressed as:

$$ \theta_t = \theta_{t-1} - \alpha (\frac{\hat{m}t}{\sqrt{\hat{v}t} + \epsilon} + \lambda \theta{t-1}) $$

where the $\lambda \theta_{t-1}$ term is outside the gradient-driven part​. This decoupled formulation means the weight decay term $\lambda\theta_{t-1}$ is not subject to the adaptive scaling of $\hat{v}_t$. In other words, AdamW applies a fixed proportional shrinkage to weights each step, independently of the gradient

Key differences between Adam and AdamW

  • Regularization term: Adam (with “weight decay”) typically implemented weight decay as L2 regularization (coupled to gradients), whereas AdamW treats weight decay as a separate step. This prevents the weight decay from affecting the momentum ($m_t$) and variance ($v_t$) estimates​.
  • Hyperparameter decoupling: In AdamW, the optimal weight decay coefficient can be tuned independently of the learning rate. Loshchilov and Hutter (2017) showed that decoupling “decouples the optimal choice of weight decay factor from the setting of the learning rate”​. In regular Adam, increasing the learning rate also implicitly increases the effective weight decay strength (since $\lambda \theta$ term would be scaled by a larger step).
  • Training dynamics: Decoupled weight decay yields more consistent regularization. AdamW does not alter the adaptive learning rates, giving a more reliable regularization effect. By contrast, in Adam the adaptive nature could cause the regularization to behave erratically across parameters or over time.
  • Convergence behavior: Empirically, AdamW often converges to a lower loss or higher accuracy than Adam given the same hyperparameters​ . In our own small-scale experiment (described later), we observed AdamW achieving a lower training loss than Adam for the same weight decay setting, underscoring that decoupling allows better optimization of the loss.

Practical Implementation: Adam and AdamW from Scratch

To solidify understanding, let’s implement simplified versions of Adam and AdamW in Python (NumPy). This will highlight the difference in their update rules.

NumPy Implementation of Adam

Below is a basic implementation of the Adam optimizer. For simplicity, this example assumes we are updating a single parameter vector theta (like flattening all parameters):

import numpy as np

def adam_update(
    theta: np.ndarray,
    grad: np.ndarray,
    m: np.ndarray,
    v: np.ndarray,
    t: int,
    lr: float = 1e-3,
    beta1: float = 0.9,
    beta2: float = 0.999,
    eps: float = 1e-8,
) -> tuple:
    # Update moments
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * (grad**2)
    # Compute bias corrected estimates
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    # Update param
    theta = theta - lr * m_hat / (np.sqrt(v_hat) + eps)
    return theta, m, v

# usage example (one step):
theta = np.array([0.0, 0.0])
m = np.zeros_like(theta)
v = np.zeros_like(theta)
t = 1
grad = np.array([0.1, -0.2])       # random exampel gradient
theta, m, v = adam_update(theta, grad, m, v, t, lr=0.001)
# Can you guess the output??

This implementation follows the original Adam update rule​. If we wanted to include weight decay in the Adam (original) way, we would modify the gradient before the moment updates, e.g. grad += weight_decay * theta, which would mix the regularization into the moment calculations.

NumPy Implementation of AdamW

For AdamW, we decouple the weight decay. This means we do not add the $ \lambda \theta $ term into grad for the moment updates. Instead, after computing the Adam step, we apply weight decay directly to theta. For example:


def adamw_update(
    theta: np.ndarray,
    grad: np.ndarray,
    m: np.ndarray,
    v: np.ndarray,
    t: int,
    lr: float = 1e-3,
    beta1: float = 0.9,
    beta2: float = 0.999,
    weight_decay: float = 0.0,
    eps: float = 1e-8,
) -> tuple:
    # AdamW update: same moment updates using grad of loss only (no weight decay term added)
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * (grad**2)
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    # Gradient-based parameter update
    theta_prev = theta.copy()
    theta = theta - lr * m_hat / (np.sqrt(v_hat) + eps)
    # decoupled weight decay step (do not apply to bias terms)
    if weight_decay != 0:
        theta = theta - lr * weight_decay * theta_prev
    return theta, m, v

# usage example (one step):
theta = np.array([0.0, 0.0])
m = np.zeros_like(theta); v = np.zeros_like(theta)
grad = np.array([0.1, -0.2])
theta, m, v = adamw_update(theta, grad, m, v, t=1, lr=0.001, weight_decay=0.01)
# Can you guess the output??

Notice the extra step at the end: theta = theta - lr * weight_decay * theta_prev. This corresponds to $\theta_t = \hat{\theta_t} - \alpha \lambda \theta_{t-1}$ as discussed earlier. By using the previous value of theta_prev in the decay term, we ensure the decay is truly decoupled (in practice, implementing theta -= lr*wd*theta_prev in code uses the updated theta value, but since the difference is $O(lr^2)$ it is negligible; one can store a copy of the old theta if needed for exactness).

PyTorch Implementation of AdamW

Modern libraries provide AdamW out-of-the-box (e.g., torch.optim.AdamW in PyTorch). However, understanding a manual implementation can come useful (e.g., when creating a custom optimizer or to prepare for an interview!). Below is how one could implement a custom AdamW optimizer in PyTorch by subclassing Optimizer:

import torch
from torch.optim import Optimizer

class CustomAdamW(Optimizer):
    def __init__(self, params, lr: float = 1e-3, betas: tuple = (0.9, 0.999),
                 weight_decay: float = 0.0, eps: float = 1e-8):
        if lr <= 0.0:
            raise ValueError("Learning rate must be positive.")
        defaults = dict(lr=lr, betas=betas, weight_decay=weight_decay, eps=eps)
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure: Optional[Callable] =None) -> None:
        loss = None
        if closure is not None:
            loss = closure()
        for group in self.param_groups:
            lr = group['lr']
            beta1, beta2 = group['betas']
            wd = group['weight_decay']
            eps = group['eps']
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data

                # state init
                state = self.state[p]
                if len(state) == 0:
                    state['step'] = 0
                    # init first and second moment
                    state['exp_avg'] = torch.zeros_like(p.data)
                    state['exp_avg_sq'] = torch.zeros_like(p.data)

                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
                state['step'] += 1
                t = state['step']

                # update biased first and second moment estimates (use inplace operations for efficiency (the _() functions))
                exp_avg.mul_(beta1).add_(grad, alpha=1-beta1)
                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1-beta2)

                # compute bias-corrected moments
                bias_corr1 = 1 - beta1**t
                bias_corr2 = 1 - beta2**t

                # compute step size
                denom = (exp_avg_sq / bias_corr2).sqrt_().add_(eps)
                step_size = lr / bias_corr1

                # gradient step
                p.data.addcdiv_(exp_avg, denom, value=-step_size)
                # Decoupled weight decay step
                if wd != 0:
                    p.data.add_(p.data, alpha=-lr * wd)
        return loss

In this implementation, we perform the standard Adam update (addcdiv_ applies $-lr * \hat{m} / (\sqrt{\hat{v}} + \epsilon)$) and then apply weight decay via p.data.add_(p.data, alpha=-lr*wd). We take care not to update bias terms with weight decay (in practice, you can set weight_decay=0 for biases by passing them as separate parameter groups). Using this custom optimizer in a training loop would look like:

model = Model()
optimizer = CustomAdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
for data, target in dataloader:
    optimizer.zero_grad()
    output = model(data)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

This should behave identically to PyTorch’s built-in AdamW optimizer. (Note: PyTorch’s Adam class also has a weight_decay parameter, but internally it implements the L2 regularization approach, not true decoupled weight decay. The AdamW class is the correct decoupled version.)

Verifying the Difference in Practice

We can test these implementations on a simple classification task to illustrate the difference between Adam and AdamW optimizers. Consider training a 2-layer neural network on a toy binary classification dataset with nonlinear structure. Both optimizers are configured with the same hyperparameters, including a non-zero weight decay, allowing us to compare how they handle regularization and generalization:

  • Setup: A synthetic 2D “moons” dataset with noise, trained using a small feedforward neural network (2 → 32 → 2).

  • Hyperparams: learning_rate = 0.01, weight_decay = 0.01 for both optimizers, 100 training epochs.

We monitor and compare the training loss curves to observe convergence speed and stability, as well as the decision boundaries to assess how each optimizer generalizes over the input space. The results are shown in the following figures:

Figure 1: Training Loss Curves for Adam and AdamW.
Figure 2: Decision Boundaries for Adam and AdamW.

Adam tends to underperform when weight decay is applied, as it couples the weight decay term with its adaptive moment estimates, often leading to suboptimal regularization. In contrast, AdamW decouples weight decay from the gradient update, resulting in more regularized and cleaner decision boundaries. While both optimizers converge in terms of training loss, AdamW typically shows smoother, more stable convergence and superior generalization. This experiment illustrates why AdamW is commonly preferred in modern deep learning frameworks, particularly when regularization through weight decay is needed.

You can find the code used in this experiment in the GitHub repository.

Conclusion

In summary, Adam vs. AdamW comes down to how weight decay is handled. AdamW’s decoupled weight decay has proven to be a simple yet critical improvement over Adam with L2 regularization. By not letting the regularization term interfere with the adaptive learning rates, AdamW provides more reliable hyperparameter tuning, often faster convergence, and better generalization​. The theoretical insights and empirical results from recent studies support why AdamW is now widely adopted​.

For practitioners, the takeaway is clear: if you are using Adam and you need regularization, prefer AdamW (or at least ensure your optimizer separates weight decay from the momentum calculation). Implementations are straightforward, as we demonstrated, and most frameworks have this built-in. However, New optimizers are constantly being proposed, and their performance can vary significantly depending on the task.