[D] How ZeRO-1 could be faster than ZeRO-2?
|
Recently, I have been diving into parallel training. Read the Ultra-Scale Playbook and technical reports from the major players. Most of it made sense intuitively, but one part stood out – real-world data parallelism (DP) strategy. First, in the book, they ran an extensive study across several thousand distributed configurations to find the optimal parameters empirically (screenshot below). I see how ZeRO-0 (vanilla DP) could make sense. But why would ZeRO-1 be faster than ZeRO-2? Next, DeepSeek V3 is trained with the same pattern ZeRO-1 over ZeRO-2 (screenshot below). ZeRO-1 and ZeRO-2 require the same data to be communicated. The way I see it, the only difference is that we keep storing all gradients on all nodes for pretty much no reason – optimizer is already sharded. Why would they use ZeRO-1 over ZeRO-2? Why would anyone? submitted by /u/fxlrnrpt |