Exploring Distributed Deep Learning with LizardDist
In this post, I’m laying the groundwork for my experiments with distributed training strategies using PyTorch and mpi4py. I want to explore approaches like distributed data parallelism, tensor parallelism, hybrid strategies, and more, digging into how communication, computation overlap, and scaling trade-offs work under the hood. LizarDist is the playground I’m building to test and learn these concepts. My goal is not just to build something that works, but to truly internalize the theory and practical challenges of distributed deep learning. ...
Part 1: How I Built My Own (tiny) Distributed Data Parallel Engine (LizarDist)
1. What is Distributed Data Parallel? Distributed Data Parallel (DDP) is a training strategy where multiple processes (GPUs) each hold a replica of the model. Every process gets a different subset of the data (hence “data parallel”) called mini-batches, computes forward and backward passes locally, and then synchronizes gradients across processes so that the model updates stay consistent globally. 2. How does Distributed Data Parallelism work? Let’s assume we have: ...
Adam vs. AdamW: A Practical Deep Dive into Optimizer Differences
Hello! This is my first blog ever!! I’ll write about Adam and AdamW. it’s always good to go back to the basics and brush up on what’s happening under the hood :), so let’s get started. Background: Adam Optimizer Overview Adam (Adaptive Moment Estimation) is a popular stochastic optimizer introduced by Kingma and Ba (2014). It combines ideas from momentum and RMSProp to adapt the learning rate for each parameter. Mathematically, Adam maintains an exponentially decaying average of past gradients (first moment) and of past squared gradients (second moment). At each step $t$, for each parameter $\theta$, Adam updates these estimates as: ...