TurboSparse-LLM Performance: Outperforming Mixtral and Gemma with Extreme Sparsity
Table of Links Abstract and 1. Introduction Related Work and Background Analysis 3.1 Limitations about Existing ReLUficatio 3.2 dReLU Are Neurons in Expert still Sparsely Activated? dReLU Sparsification Experiments Results 6.1 Downstream Tasks Performance 6.2 Sparsity of Sparsified Models Practical Inference Speedup Evaluation 7.1 Experiments Setting 7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference 7.4 Deploy LLMs on mobile phones Conclusion and References A. Appendix / supplemental material B. Limitation C. Broader Impact 6 Experiments Results 6.1 […]