arXiv 2403.10444

Block Verification Accelerates Speculative Decoding

By Ziteng Sun, Uri Mendlovic, et al.

Published 2024-03-15

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we sho…

View the original paper on arXiv