I implemented DPO from the paper and the reward margin hit 599 here’s what that actually means

DPO (Rafailov et al., NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the training loop, no value function, no rollout collection. Just a binary cross-entropy loss over preference pairs. And the math is elegant the partition function Z(x) cancels out when you substitute the log-ratio reparameterisation into the Bradley-Terry model.

I implemented it from scratch as part of a multi-stage RLHF project (same model, same tokenizer, same evaluation suite as my PPO and GRPO implementations). Here’s what actually happened.

The get_logps function

This is where silent failures live. The shift has to be exact:

python

shift_logits = logits[:, :-1, :] # predict positions 1..T shift_labels = input_ids[:, 1:] # actual tokens 1..T shift_mask = response_mask[:, 1:] # only response positions 

The mask shifts by one to align with shifted labels. Get this wrong and the loss looks normal while the model is supervising prompt tokens instead of response tokens. No obvious error signal.

What reward hacking looks like in a loss curve

By step 30, loss = 0.0 and accuracy = 1.0. This looks like fast convergence. It isn’t.

The reward margin tells the real story:

Step Margin
30 56.9
70 240.7
150 599.2

A healthy margin is 1–10. At 599 the policy has drifted so far from the reference that it assigns near-zero probability to the rejected response for every pair. The model memorised the preference signal rather than learning a generalizable preference.

Root cause: batch size of 1 with no averaging. Each update can completely overfit one (chosen, rejected) pair before moving to the next.

What the step 20 behaviour tells you

At step 20: loss = 0.693, accuracy = 0.0, margin = 0.0.

0.693 = log(2) = -log(σ(0)). This is the degenerate case the theory predicts when the policy exactly mirrors the reference, all log-ratios are zero, the DPO margin is zero, and the loss equals log 2. The model is assigning equal probability to chosen and rejected. Seeing this in a real training run is a nice confirmation that the implementation is correct.

The verdict

The architecture is sound. The loss, the frozen reference model, the get_logps masking, the RM-free training loop all correct. What broke was the training configuration, not the algorithm. These Phase 1 results (avg reward: 2.40) were later tuned β from 0.1 to 0.3, proper batching and compared head-to-head against PPO and GRPO on the same 16 prompts.

The full comparison is in a separate write-up. The ranking completely reversed after tuning. DPO went from 3rd to 1st.

Full DPO implementation post: brayanbrayan.github.io/machine-learning/rlhf/2026/03/24/dpo-implementation-blog.html

Full comparison study: brayanbrayan.github.io/2026/04/02/rlhf-post-blog.html

Happy to answer questions on any of the implementation details.

submitted by /u/Public_Expression_92
[link] [comments]

Liked Liked