arXiv 2401.04088

Mixtral of Experts

By Albert Q. Jiang, Alexandre Sablayrolles, et al.

Published 2024-01-08

Citation lineage

Review the prior work and downstream research connected to this paper.

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be diffe…

View the original paper on arXiv