arXiv 2403.10444
Block Verification Accelerates Speculative Decoding
By Ziteng Sun, Uri Mendlovic, et al.
Published 2024-03-15
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we sho…