arXiv 2502.15470

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

By Yintao He, Haiyu Mao, et al.

Published 2025-02-21

Citation lineage

Review the prior work and downstream research connected to this paper.

Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Som…

View the original paper on arXiv