arXiv 2601.22156

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

By Yingfa Chen, Zhen Leng Thai, et al.

Published 2026-01-29

Citation lineage

Review the prior work and downstream research connected to this paper.

Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through pa…

View the original paper on arXiv