[R] I am looking for good research papers on compute optimization during model training, ways to reduce FLOPs, memory usage, and training time without hurting convergence.
Interested in topics like mixed precision, gradient checkpointing, optimizer efficiency, sparsity, distributed training (ZeRO, tensor/pipeline parallelism), and compute-optimal scaling laws (e.g., Chinchilla-style work). Practical papers that apply to real multi-GPU setups would be especially helpful.
Any solid recommendations?
submitted by /u/ocean_protocol
[link] [comments]
Like
0
Liked
Liked