TurboSparse Mobile: 22x Faster Mixtral Inference on PowerInfer-2

Table of Links

Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

7.4 Deploy LLMs on mobile phones

We also serve TurboSparse-Mixtral-47B by using PowerInfer-2 that supports LLM inference on mobile phones. PowerInfer-2 leverages the sparse activation feature during LLM inference and


Table 9: Decoding Speed on Mobile Phones (tokens/s)


introduces a computational engine on heterogeneous XPUs. It can perform high-speed inference even when the model parameters exceed DRAM capacity. As shown in Table 9, PowerInfer-2 achieves a 22.2× speedup using TurboSparse-Mixtral-47B inference compared to llama.cpp with the original Mixtral-47B. This significant performance gain is primarily because PowerInfer-2 can fully exploit the extremely high sparsity that TurboSparse demonstrates during inference.

:::info
Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::


:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

Liked Liked