Technology

MoE

MoE scales model capacity by activating only a sparse subset of specialized parameters for each input token.

Mixture of Experts (MoE) replaces dense feed-forward layers with a collection of specialized sub-networks (experts) managed by a gating mechanism. This architecture allows models like Mixtral 8x7B or GPT-4 to scale to trillions of parameters while maintaining the inference latency of much smaller models. By routing each token to the top-2 most relevant experts, the system maximizes computational efficiency: it increases total parameter count by 10x or more without a linear increase in FLOPs (floating point operations). This sparse activation strategy is the primary driver for current state-of-the-art performance in large language models.

https://arxiv.org/abs/1701.06538

0 projects · 0 cities

Recent Talks & Demos

Showing 1-0 of 0

Members-Only

No public projects found for this technology yet.