Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

10/05/2025 13 min

Listen "Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework"

Episode Synopsis

This paper proposes a new method for optimizing the data mixtures used to train large language models (LLMs). Traditional approaches often rely on costly trial and error or deterministic extrapolations that don't account for uncertainty, limiting their effectiveness and transferability. The authors introduce a multi-fidelity multi-scale Bayesian optimization framework, treating data curation as a sequential decision-making process where decisions about data mixture, model scale, and training duration are adaptively chosen to balance training costs and potential performance gains. This framework uses a probabilistic model to explicitly model performance uncertainty and allows for learning from less expensive, smaller-scale experiments to inform decisions for larger, more costly training runs. Empirical results show that this approach, even with simple implementations, can significantly accelerate the process of finding optimal data mixtures compared to existing methods.

More episodes of the podcast Best AI papers explained