arXiv 2602.12237
Olmix: A Framework for Data Mixing Throughout LM Development
By Mayee F. Chen, Tyler Murray, et al.
Published 2026-02-12
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across exi…