arXiv 2212.09720

The case for 4-bit precision: k-bit Inference Scaling Laws

By Tim Dettmers and Luke Zettlemoyer

Published 2022-12-19

Citation lineage

Review the prior work and downstream research connected to this paper.

Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracie…

View the original paper on arXiv