arXiv 2511.02132

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

By Mansi Choudhary, Karthik Sangaiah, et al.

Published 2025-11-03

Citation lineage

Review the prior work and downstream research connected to this paper.

The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA ef…

View the original paper on arXiv