arXiv 2407.02646
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
By Daking Rai, Yilun Zhou, et al.
Published 2024-07-02
Citation lineage
Review the prior work and downstream research connected to this paper.
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensively reviews these ins…