arXiv 2403.10444
Block Verification Accelerates Speculative Decoding
By Ziteng Sun, Uri Mendlovic, et al.
Published 2024-03-15
Citation lineage
Review the prior work and downstream research connected to this paper.
Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we sho…