arXiv 2511.02132

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

By Mansi Choudhary, Karthik Sangaiah, et al.

Published 2025-11-03

Discussion

Read the public discussion and references gathered around this paper.

The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA ef…

View the original paper on arXiv