arXiv 2212.09720
The case for 4-bit precision: k-bit Inference Scaling Laws
By Tim Dettmers and Luke Zettlemoyer
Published 2022-12-19
Citation lineage
Review the prior work and downstream research connected to this paper.
Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracie…