Mapping Gemma3 onto an Edge Dataflow Architecture

arXiv:2602.06063v1 Announce Type: new
Abstract: We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures attention to sustain high memory bandwidth utilization. Together with a compact Q4NX 4-bit quantization format, these methods yield up to $5.2times$ faster prefill and $4.8times$ faster decoding versus the iGPU, and $33.5times$ and $2.2times$ over the CPU, respectively. Power efficiency improves by as much as $67.2times$ and $222.9times$ compared to the iGPU and CPU. The proposed approach demonstrates that modern NPUs can deliver practical, low-power LLM and VLM inference at the edge, and provides a generalizable blueprint for mapping transformer-based models onto tiled dataflow accelerators.

Liked Liked