arXiv 2208.07339
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
By Tim Dettmers, Mike Lewis, et al.
Published 2022-08-15
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediatel…