A tutorial about unstable Critic, bad reward scaling, lack of normalization, wrong entropy or blocked policy

digitado ⋅ 20 de January de 2026

What you will learn from this tutorial:

Why Actor–Critic exists, and why Q-learning/DQN and pure gradient policy are not enough for real problems.
What are the real limitations of value-based methods and policy-gradient methods: variance, stability, late feedback, weak exploration, difficulties in continuous actions.
How Actor–Critic solves these problems, by clearly separating the roles: actor = decision, critic = evaluation, and by introducing stable feedback through TD-learning.
How the Actor–Critic cycle works in practice, step-by-step: observation –> action –> reward –> evaluation –> policy and values update.
Why stability in RL is not random, how the Critic reduces the gradient variance, and what is the trade-off between stability (low variance) and bias.
What does a Critic “too weak” or “too strong” mean in practice, how this looks in TensorBoard and why the Actor sometimes seems “crazy” when, in fact, the Critic is the problem.
How to choose correctly between V(s), Q(s,a) and Advantage, what each variant changes in the learning dynamics and why Advantage Actor–Critic is the modern “sweet spot”.
How the theory connects to real algorithms: how “Actor–Critic from the book” becomes A2C, A3C, PPO, DDPG, TD3 and SAC.
The clear difference between on-policy and off-policy, what it means in terms of sample efficiency and stability, and when to use each approach.
Why PPO is the “workhorse” of modern RL, and in which situations SAC outperforms it, especially in robotics and continuous control.
In which real-world scenarios does Actor–Critic really matter, from robotics and locomotion to finance, energy and industrial systems where data stability and efficiency are critical.
How to use Gymnasium intelligently, not as a game: what problems do CartPole, Acrobot and Pendulum solve and what insights do you transfer directly to real robots.
What does a functional Actor–Critic look like in reality, without long code: the logical structure for discrete and continuous action spaces.
What are the hyperparameters that really matter (actor vs critic LR, discount, PPO clipping, SAC temperature) and how do they influence stability and performance.
What graphs should you watch as a professional, not as a beginner: value loss, policy loss, entropy, reward, TD-error and what they tell you about the health of the agent.
The real pitfalls that many don’t tell you, such as unstable Critic, bad reward scaling, lack of normalization, wrong entropy or blocked policy.
Why Actor–Critic isn’t just theory, but has become the foundation of modern RL — and why, if you understand Actor–Critic, you understand virtually all of RL that matters in the real world.

Link: What is Actor-Critic in Reinforcement Learning?

submitted by /u/Capable-Carpenter443
[link] [comments]

Like 0

Liked Liked