GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations
arXiv:2602.17817v1 Announce Type: new Abstract: Collocating deep learning training tasks improves GPU utilization but causes drastic slowdowns due to resource contention and risks Out-of-Memory (OOM) failures. Accurate memory estimation is essential for robust collocation, while GPU utilization — a key proxy for resource contention — enables interference-aware scheduling to reduce slowdowns and improve throughput. Existing GPU memory estimators span three paradigms — analytical models, CPU-side libraries, and ML-based estimators — each with distinct limitations: dependence on detailed model […]