arXiv 2212.09720

The case for 4-bit precision: k-bit Inference Scaling Laws

By Tim Dettmers and Luke Zettlemoyer

Published 2022-12-19

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracie…

View the original paper on arXiv