arXiv 2208.07339
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
By Tim Dettmers, Mike Lewis, et al.
Published 2022-08-15
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediatel…