arXiv 2208.07339

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

By Tim Dettmers, Mike Lewis, et al.

Published 2022-08-15

Citation lineage

Review the prior work and downstream research connected to this paper.

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediatel…

View the original paper on arXiv