arXiv 2604.03436

MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

By Matthew Levinson

Published 2026-04-03

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across sema…

View the original paper on arXiv