arXiv 2403.10444
Block Verification Accelerates Speculative Decoding
By Ziteng Sun, Uri Mendlovic, et al.
Published 2024-03-15
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we sho…