Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
arXiv:2601.14304v1 Announce Type: new Abstract: Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight […]