arXiv 2511.02132
Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects
By Mansi Choudhary, Karthik Sangaiah, et al.
Published 2025-11-03
Discussion
Read the public discussion and references gathered around this paper.
The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA ef…