Adam vs. AdamW: A Practical Deep Dive into Optimizer Differences

Hello! This is my first blog ever!! I’ll write about Adam and AdamW. it’s always good to go back to the basics and brush up on what’s happening under the hood :), so let’s get started. Background: Adam Optimizer Overview Adam (Adaptive Moment Estimation) is a popular stochastic optimizer introduced by Kingma and Ba (2014). It combines ideas from momentum and RMSProp to adapt the learning rate for each parameter. Mathematically, Adam maintains an exponentially decaying average of past gradients (first moment) and of past squared gradients (second moment). At each step $t$, for each parameter $\theta$, Adam updates these estimates as: ...

April 4, 2025 · 11 min · 2167 words