arXiv 2006.16668
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
By Dmitry Lepikhin, HyoukJoong Lee, et al.
Published 2020-06-30
Citation lineage
Review the prior work and downstream research connected to this paper.
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module…