A Survey on Efficient LLM Training: From Data-centric Perspectives
Jul 31, 2025·,,,
,,,,,,·
1 min read
Junyu Luo
Bohan Wu
Xiao Luo
Zhiping Xiao
Yiqiao Jin
Rong-Cheng Tu
Nan Yin
Yifan Wang
Jingyang Yuan
Wei Ju
Ming Zhang
Abstract
Efficient training of large language models has become a central concern as model and data scales grow. This survey reviews efficient LLM training from a data-centric perspective, organizing techniques around data selection, mixing, ordering, and synthesis. We discuss trade-offs between compute, data quality, and downstream performance, and identify open challenges in scaling data-centric efficiency to frontier LLMs.
Type
Publication
Annual Meeting of the Association for Computational Linguistics (ACL) 2025, Main Conference
Abstract
Efficient training of large language models has become a central concern as model and data scales grow. This survey reviews efficient LLM training from a data-centric perspective, organizing techniques around data selection, mixing, ordering, and synthesis.