[D] During long training sessions, how do you manage to get your code to work in the first couple of tries?

I’ve tried doing sanity checks and they work great for the most part, but what if there is just a part of the data, or an instance where the model fails? How do you watch out for something like that so that hours of GPU compute just don’t go down the drain. I’ve also heard about saving weights/progress at certain checkpoints, but for other tasks such as model evals how would that work?

submitted by /u/Specialist-Pool-6962
[link] [comments]

Liked Liked