[D] Why is focal loss not used in LLM training?

I have been recently using focal loss for heavily imbalanced image and text classification tasks and have been seeing a very large boost in a production environment.

For those that don’t know how focal loss works: focal loss reduces the importance of “easy” examples so that the model can focus its learning on “hard” examples.

Now i have been thinking that LLM models based on the transformer architecture are essentially an overglorified classifier during training (forced prediction of the next token at every step). Isn’t this task with massive vocabs (e.g. 256k) essentially an extremely imbalanced task and also because some tokens are very easy to predict.

For example, In the DeepSeek paper the team trained distillations based on the teacher forced reasoning traces, and these traces are full of easy token sequences that push down the loss by a lot initially (e.g. “But wait! I need to consider that…”), and it doesn’t make sense from my perspective to try to improve the performance of all tokens equally in the cross entropy loss function, so why is no one using the focal loss loss function to focus only on the hard tokens?

It would also be interesting to know how a LLM pretrained with focal loss would perform.

Is there anything that I haven’t thought about that would make this not work, or is this simply untested?

submitted by /u/Electrical-Monitor27
[link] [comments]

Liked Liked