A tutorial about how to fix one of the most misunderstood strategies: Exploration vs Exploitation

digitado ⋅ 13 de January de 2026

In this tutorial:

You will understand that Exploration vs Exploitation is not a button, it is not “epsilon“, but a real data collection strategy, which decides what the agent can learn and how good it can become.
You will see why the training reward can lie to you, why an agent without exploration can look “better” on the graph, but actually be weaker in reality.
You will learn where exploration actually occurs in an Markov Decision Process(MDP), not only in actions, but also in states and in the agent’s policy; and why this matters enormously.
You will understand what exploiting a wrong policy means, how lock-in occurs, why exploiting too early can destroy learning, and what this looks like in practice.
You will learn the different types of exploration in modern RL: epsilon, entropy, optimism, uncertainty, curiosity; and what each solves and where it falls short.
You will learn to interpret data correctly: when reward means something, when it doesn’t, what entropy means, action diversity, state distribution and seed sensitivity.
You will see everything in practice, in a FrozenLake + DQN case study, with three types of exploration: no exploration, large exploration and controlled exploration; and you will understand what is really happening and why.

Like 0

Liked Liked