arXiv 2401.04088
Mixtral of Experts
By Albert Q. Jiang, Alexandre Sablayrolles, et al.
Published 2024-01-08
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be diffe…