arXiv 2601.22156
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
By Yingfa Chen, Zhen Leng Thai, et al.
Published 2026-01-29
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through pa…