Loss Landscapes: Part 2

And why they matter in Machine learning!

Read Loss Landscapes: Part 1

In the previous part, we learned what loss landscapes are, how they are formed, and the difference between simple, convex bowls and rugged, non-convex mountains. But knowing what the landscape looks like is only half the battle. How do we actually find the bottom of those valleys to train our machine learning model?

6: Gradient Descent

Gradient descent is the core optimization algorithm in machine learning. Its entire job is to help the model navigate this complex terrain and find the region with the lowest possible loss.

The goal you ask? → To reach the GLOBAL MINIMA”

The Intuition: Hiking in the Fog

To understand gradient descent, imagine you are hiking down a rugged mountain, but there is a thick, heavy fog. You cannot see the bottom of the valley, and you definitely cannot see the overall shape of the mountain range. You can only feel the slope of the ground directly beneath your feet.

How do you get to the bottom? You use your feet to feel which direction slopes downward, and you take a step in that direction. You repeat this process over and over until the ground finally levels out.

This is exactly how an AI model trains!

  • The Mountain: The loss landscape.
  • Your Current Position: The model’s current weights (parameters).
  • The Slope: The “gradient” (a mathematical calculation of steepness).
  • Taking a Step: Updating the weights to move closer to the minimum.

How the Process Works

When you train a model, you are putting gradient descent to work through a repeating loop:

  1. Start Somewhere: The model begins with random weights, dropping you at a random spot on the mountain.
  2. Compute the Gradient (Feel the slope): The algorithm calculates the gradient at your exact current position. Mathematically, a gradient points in the direction of the steepest uphill climb.
  3. Go Downhill (Update the weights): Since our goal is to minimize loss, the algorithm multiplies the gradient by a negative number to flip the direction, forcing the model to take a step downhill.
  4. Repeat: The model takes step after step, recalculating the slope each time, until it reaches a valley where the loss is low and the ground is flat (a minimum).

Through this continuous process of feeling the slope and stepping downward, gradient descent successfully guides the model to a state where it can make accurate predictions.

Let us take a look at an example path our model follows in order to reach global minima.

Left: Gradient descent in Motion, Right: loss landscape

As you might have noticed in the above diagram, our model followed the algorithm perfectly, but they didn’t end up at the lowest possible point. They got stuck in a smaller, higher valley (local minima)!

This is a perfect example of why training AI isn’t always a smooth journey. Because real-world loss landscapes are highly non-convex (like rugged mountain ranges), they are filled with geographical traps that make optimization, well… less than optimal 😅

7. Challenges in Navigating Loss Landscapes

Here are the four most common challenges an AI model faces while exploring a loss landscape:

  • Local Minima: These are deceptive, shallow valleys. To the gradient descent algorithm, a local minimum looks and feels exactly like the true bottom (the global minimum) because every direction points uphill. Once a model rolls into one, it is very hard for it to climb back out.
  • Saddle Points: Imagine the leather saddle on a horse’s back. If you look side-to-side, it slopes down. But if you look front-to-back, it curves up. At the exact center, the ground is completely flat. Algorithms often get temporarily stuck here because the slope drops to zero, confusing the model about which way is actually downhill.
  • Flat Plateaus: These are vast, flat regions of the landscape where the slope is practically zero. If our hiker reaches a plateau, they can’t feel any downward tilt at all. The model’s learning slows to a crawl, and it wanders aimlessly without improving its loss.
  • Sharp Valleys: Think of a steep, narrow canyon. Instead of walking smoothly down the center of the path, the algorithm’s steps might be too wide, causing it to violently bounce back and forth between the steep canyon walls. This makes it incredibly difficult to make forward progress toward the bottom.

8. Methods That Help Optimization

We’ve seen that loss landscapes can be treacherous, filled with traps like local minima and flat plateaus. Fortunately, smart researchers have developed a toolkit of techniques to smooth the path and help our models find their way to a good solution.

Here are four of the most important methods you need to know.

1. Weight Decay (Regularization)

  • The Problem: Sometimes, a model tries too hard to perfectly fit the training data, leading to a complex, wiggly loss landscape. This is called overfitting. The model learns the “noise” in the data instead of the actual pattern.
  • The Solution: Think of weight decay as a tax on complexity. It adds a small penalty to the loss function for having large weights. This forces the model to keep its weights small and simple, preventing it from relying too much on any single feature.
  • The Effect: This effectively smooths out the rugged bumps in the loss landscape, making it easier for gradient descent to roll down into a nice, wide valley that generalizes well to new data.

Visualizing the Effect of Weight Decay:

Left: shows a highly rugged, chaotic loss landscape with many sharp peaks and narrow valleys. Right: shows the same general shape but significantly smoothed out, with wider valleys and fewer sharp spikes.

2. Dropout

  • The Concept: This is another powerful technique to prevent overfitting. Imagine training a sports team where, during every practice session, you randomly bench half the players. The remaining players can’t rely on a single “star player” to win; they have to learn to cooperate and develop robust strategies that work regardless of who is on the field.
  • How it Works: During training, dropout randomly “turns off” a percentage of neurons in each layer. This forces the network to learn redundant representations and prevents neurons from co-adapting too closely to each other.
  • The Result: A more robust model that generalizes better to unseen data.

A Neural Network with Dropout:

X” indicates neurons that are dropped out during training

NOTE: Only during training, Dropout regularization happens and the neurons are “turned-off”, during inference all the neurons are active (turned-on)

3. Residual Connections (Skip Connections)

  • The Problem: In very deep networks (with dozens or hundreds of layers), the gradient signal can get weaker and weaker as it travels backward during training, like a message getting garbled in a long game of “telephone.” This is called the vanishing gradient problem.
  • The Solution: Residual connections act like a shortcut. They allow the gradient signal to skip over one or more layers and flow directly to deeper parts of the network. This preserves the strength of the signal, making it possible to train incredibly deep networks like ResNet.

A Residual Block:

Visualizing the Effect of Skip Connections:

4. Batch Normalization

  • The Problem: As data flows through a deep network, the distribution of inputs to each layer keeps changing as the weights in previous layers are updated. This is like trying to build a tower on shifting sands — it makes training unstable and slow.
  • The Solution: Batch normalization is a layer that re-centers and re-scales the data before passing it to the next layer. It stabilizes the learning process by ensuring that each layer sees inputs with a consistent mean and variance.
  • The Effect: This has a surprisingly powerful effect on the loss landscape. It drastically smooths out the terrain, turning a chaotic, bumpy ride into a much smoother and faster journey to the minimum. This allows us to use higher learning rates and train models much more quickly.

5. Layer Normalization (The Transformer’s Best Friend)

  • The Problem: Batch Normalization is amazing for image processing (like CNNs), but it has a major weakness: it relies on having a large “batch” of data at once to calculate its averages. If your batch size is too small, or if you are working with text sequences of different lengths (like sentences in a book), Batch Norm gets confused and unstable.
  • The Solution: Instead of calculating the average across a whole batch of different examples, Layer Normalization calculates the average across all the features of a single data point. Imagine grading a test: Batch Norm grades one specific question across all students, while Layer Norm looks at a single student’s overall performance across all questions.
  • The Effect: It provides the exact same landscape-smoothing, gradient-stabilizing benefits as Batch Normalization, but it works flawlessly for sequence data. It keeps the optimization path stable and prevents the loss landscape from becoming incredibly spiky.
  • Where you see it: If you have heard of Transformers or Large Language Models (like ChatGPT or Gemini), they rely heavily on Layer Normalization to successfully train their billions of parameters without getting lost!
Batch Norm vs Layer Norm

9. Conclusion

To wrap things up, by putting the loss landscape at the center of our focus, we haven’t just learned about a single mathematical concept — we’ve actually explored the core engine of machine learning as a whole!


Loss Landscapes: Part 2 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked