TurboSparse Mobile: 22x Faster Mixtral Inference on PowerInfer-2

digitado ⋅ 4 de March de 2026

Table of Links

7.4 Deploy LLMs on mobile phones

We also serve TurboSparse-Mixtral-47B by using PowerInfer-2 that supports LLM inference on mobile phones. PowerInfer-2 leverages the sparse activation feature during LLM inference and

Table 9: Decoding Speed on Mobile Phones (tokens/s)

introduces a computational engine on heterogeneous XPUs. It can perform high-speed inference even when the model parameters exceed DRAM capacity. As shown in Table 9, PowerInfer-2 achieves a 22.2× speedup using TurboSparse-Mixtral-47B inference compared to llama.cpp with the original Mixtral-47B. This significant performance gain is primarily because PowerInfer-2 can fully exploit the extremely high sparsity that TurboSparse demonstrates during inference.

:::info
Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::

:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

Like 0

Liked Liked