Finetuning Large Language Models On A Single GPU Using Gradient Accumulation

digitado ⋅ 28 de March de 2023

Previously, I shared an article using multi-GPU training strategies to speed up the finetuning of large language models. Several of these strategies include mechanisms such as model or tensor sharding that distributes the model weights and computations across different devices to work around GPU memory limitations. However, many of us don’t have access to multi-GPU resources. So, this article illustrates a simple technique that works as a great workaround to train models with larger batch sizes when GPU memory is a concern: gradient accumulation.

Like 0

Liked Liked