arXiv 2507.14805

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

By Alex Cloud, Minh Le, et al.

Published 2025-07-20

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove refe…

View the original paper on arXiv