Running MoE on Mobile Phones: Meta Proposes MobileMoE, Speeding Up iPhone 16 Pro by 3.8 Times
Running MoE on Mobile Phones: Meta Proposes MobileMoE, Speeding Up iPhone 16 Pro by 3.8 Times
https://eu.36kr.com/en/p/3831266999887490
Publish Date: 2026-06-01 02:30:00
Source Domain: eu.36kr.com
In recent years, Mixture of Experts (MoE) models have been widely used in large cloud-based models. However, on mobile phones, Large Language Models (LLMs) still mainly adopt dense architectures. In the past, mobile devices had more stringent constraints on memory, computing power, and latency, and there had been a lack of systematic research on edge-side MoE within the range of sub-billion active parameters. Now, with the increase in the DRAM capacity of mobile devices, MoE also has the opportunity to be deployed on smartphones.
The MobileMoE proposed by the Meta team has achieved efficient MoE inference on commercial smartphones for the first time. The results show that in 14 basic tests, with similar memory usage, MobileMoE-S/M only uses 1/2 to 1/4 of the inference computation of the dense baseline, and achieves comparable or even higher average accuracy. In actual tests, MobileMoE-S shows the most significant speedup on the GPU/MLX backend of the iPhone 16 Pro, with a maximum speedup of 3.8 times in the input stage.
Paper link: https://arxiv.org/abs/2605.27358
The research team also proposed a set of edge-side MoE scaling rules to determine the model structure more suitable for mobile phone deployment. MobileMoE has established a new Pareto frontier for edge-side large language models, achieving better results in the trade-off between accuracy and inference computation overhead.
Figure | MobileMoE has established a new Pareto frontier for edge-side large language models.
How is MobileMoE designed?
MobileMoE can be understood in this way: it is a type of MoE language model designed for edge-side deployment. The overall structure is still a decoder-only Transformer, but the original dense feed-forward layer is replaced with a MoE layer. The router selects a small number of experts with the highest scores for each token to participate in the calculation, and there is also a shared expert that always participates in the calculation. The entire training process…