A Survey on Efficient LLM Training: From Data-centric Perspectives

Jul 31, 2025·
Junyu Luo
,
Bohan Wu
,
Xiao Luo
,
Zhiping Xiao
Yiqiao Jin
Yiqiao Jin
,
Rong-Cheng Tu
,
Nan Yin
,
Yifan Wang
,
Jingyang Yuan
,
Wei Ju
,
Ming Zhang
· 1 min read
Abstract
Efficient training of large language models has become a central concern as model and data scales grow. This survey reviews efficient LLM training from a data-centric perspective, organizing techniques around data selection, mixing, ordering, and synthesis. We discuss trade-offs between compute, data quality, and downstream performance, and identify open challenges in scaling data-centric efficiency to frontier LLMs.
Type
Publication
Annual Meeting of the Association for Computational Linguistics (ACL) 2025, Main Conference

Abstract

Efficient training of large language models has become a central concern as model and data scales grow. This survey reviews efficient LLM training from a data-centric perspective, organizing techniques around data selection, mixing, ordering, and synthesis.

Yiqiao Jin
Authors
Ph.D. Candidate in Computer Science
My research focuses on adaptive and efficient AI systems, with emphasis on LLM agents, agent memory, self-distillation, multimodal LLMs, and structured multi-agent intelligence.