Exploring Distributed Deep Learning with LizardDist

In this post, I’m laying the groundwork for my experiments with distributed training strategies using PyTorch and mpi4py. I want to explore approaches like distributed data parallelism, tensor parallelism, hybrid strategies, and more, digging into how communication, computation overlap, and scaling trade-offs work under the hood. LizarDist is the playground I’m building to test and learn these concepts. My goal is not just to build something that works, but to truly internalize the theory and practical challenges of distributed deep learning. ...

June 30, 2025 · 1 min · 129 words

Part 1: How I Built My Own (tiny) Distributed Data Parallel Engine (LizarDist)

1. What is Distributed Data Parallel? Distributed Data Parallel (DDP) is a training strategy where multiple processes (GPUs) each hold a replica of the model. Every process gets a different subset of the data (hence “data parallel”) called mini-batches, computes forward and backward passes locally, and then synchronizes gradients across processes so that the model updates stay consistent globally. 2. How does Distributed Data Parallelism work? Let’s assume we have: ...

June 30, 2025 · 6 min · 1258 words