[D] How ZeRO-1 could be faster than ZeRO-2?

[D] How ZeRO-1 could be faster than ZeRO-2?

Recently, I have been diving into parallel training. Read the Ultra-Scale Playbook and technical reports from the major players.

Most of it made sense intuitively, but one part stood out – real-world data parallelism (DP) strategy.

First, in the book, they ran an extensive study across several thousand distributed configurations to find the optimal parameters empirically (screenshot below).

I see how ZeRO-0 (vanilla DP) could make sense. But why would ZeRO-1 be faster than ZeRO-2?

https://preview.redd.it/xua9g0nls9kg1.png?width=988&format=png&auto=webp&s=3f59b79688ba8425a2951df5bf34fba16096ed85

Next, DeepSeek V3 is trained with the same pattern ZeRO-1 over ZeRO-2 (screenshot below).

https://preview.redd.it/lui7hz98t9kg1.png?width=1576&format=png&auto=webp&s=4a862df722e0cccdb2ed3d9afd927ef7b05031d1

ZeRO-1 and ZeRO-2 require the same data to be communicated. The way I see it, the only difference is that we keep storing all gradients on all nodes for pretty much no reason – optimizer is already sharded.

Why would they use ZeRO-1 over ZeRO-2? Why would anyone?

submitted by /u/fxlrnrpt
[link] [comments]

Liked Liked