[R] LLaDA2.1 vs Qwen3 30B A3B: Benchmarking discrete diffusion LLMs against autoregressive MoE models

Been digging into the LLaDA2.1 paper (arXiv:2602.08676) and ran some comparisons that I think are worth discussing. The core claim is that discrete diffusion language models can now compete with AR models on quality while offering substantially higher throughput. The numbers are interesting but the tradeoffs are more nuanced than the headline results suggest.

The paper introduces a T2T (Token to Token) editing mechanism on top of the standard M2T (Mask to Token) scheme, controlled by dual thresholds τmask and τedit. This lets the model retroactively correct errors during parallel decoding, which addresses the local inconsistency issues Kang et al. pointed out earlier this year. They also present EBPO (ELBO based Block level Policy Optimization) which they claim is the first large scale RL framework for dLLMs, noting that prior work like SPG, TraceRL, and ESPO struggled with variance and compute costs. The training stack uses dFactory for CPT/SFT and extends the AReaL framework for RL, which seems purpose built for this architecture.

Here’s what caught my attention in the benchmarks across 33 tasks:

Qwen3 30B A3B Inst 2507: 73.09 avg Ling flash 2.0: 71.52 avg LLaDA2.1 flash S Mode: 72.34 avg LLaDA2.1 flash Q Mode: 73.54 avg

So Q Mode slightly edges out Qwen3, but S Mode actually underperforms LLaDA2.0 (72.43). The throughput story is where it gets compelling: LLaDA2.1 flash with quantization hits 674.3 TPS average in S Mode versus Qwen3 30B A3B at 240.2 TPS. The mini model peaks at 1586.93 TPS on HumanEval+.

The Multi Block Editing results show consistent gains (ZebraLogic 84.20→88.20, AIME 2025 63.33→70.00) but at the cost of TPF dropping from 5.82 to 5.14.

I pulled the repo and ran the mini model on some coding tasks using their customized SGLang setup with per block FP8 quantization on a pair of A100s. The speed difference is immediately noticeable and roughly in line with their reported numbers, though I did observe the stuttering artifacts they mention when pushing τmask too low. The ngram repetition issue is real and shows up faster than I expected on open ended prompts. What I find most honest about the paper is the limitations section. They explicitly state that aggressive threshold settings produce rough drafts with these artifacts, and that S Mode can cause undesirable output in general chat scenarios even though it works well for code and math. The threshold parameters also need domain specific tuning.

A few things I’m curious about after spending time with this. The speed versus quality tradeoff seems heavily dependent on task domain. Has anyone tested the S/Q mode split on tasks outside their benchmark suite? The EBPO approach uses ELBO as a proxy for exact likelihood with vectorized estimation, and for those familiar with dLLM training, I’m wondering how this compares to the variance issues in prior RL attempts. Also, the paper positions the dual threshold system as a user configurable continuum but in practice, how sensitive is performance to threshold selection across different use cases?

Paper: https://arxiv.org/abs/2602.08676 Code: https://github.com/inclusionAI/LLaDA2.X

Models available: LLaDA2.1 Mini (16B) and LLaDA2.1 Flash (100B)

submitted by /u/Inevitable_Wear_9107
[link] [comments]

Liked Liked