fMoE: Fine-Grained Expert Offloading for MoE Serving

13/08/2025 12 min

Listen "fMoE: Fine-Grained Expert Offloading for MoE Serving"

Episode Synopsis

This February 2025 paper introduces fMoE, a novel fine-grained expert offloading system designed to optimize the serving efficiency of Mixture-of-Experts (MoE) Large Language Models (LLMs). The paper highlights the memory inefficiency of current MoE-based LLMs during inference due to inactive experts residing in GPU memory and the limitations of existing coarse-grained offloading solutions that struggle with latency-memory trade-offs. fMoE addresses these challenges by tracking iteration-level expert probability distributions through "expert maps" and leveraging input semantic embeddings to intelligently guide expert prefetching, caching, and offloading decisions. Experiments show that fMoE significantly reduces inference latency and improves expert hit rates compared to state-of-the-art methods.Source: https://arxiv.org/html/2502.05370v1