Why Is the Optimal Policy Deterministic in Standard MDPs?

digitado ⋅ 23 de February de 2026

Something that confused me for a long time:

If policies are probability distributions

π(a | s)

why is the optimal policy in a standard MDP deterministic?

Step 1 — Bellman Optimality

For any state s:

V*(s) = max over π of Σ_a π(a | s) * q*(s, a)

where

q*(s, a) = r(s, a) + γ * Σ_{s'} P(s' | s, a) * V*(s')

So at each state, we are solving:

max over π E_{a ~ π}[ q*(s, a) ]

Σ_a π(a | s) * q*(s, a)

is a weighted average:

And a weighted average is always ≤ the maximum element.

Equality holds only if all weight is placed on the maximum.

Therefore, the optimal policy can be written as:

π*(a | s) = 1 if a = argmax_a q*(s, a) = 0 otherwise

The optimal policy can be chosen as a deterministic greedy policy.

So if the optimal policy in a standard MDP can always be chosen as deterministic and greedy…

why do most modern RL algorithms (PPO, SAC, policy gradients, etc.) explicitly learn stochastic policies?

Is it purely for exploration during training?
Is it an optimization trick to make gradients work?

————————————————————-

Proof (Why the optimum is deterministic)

Suppose we want to solve:

max over c1, c2, c3 of c1 q1 + c2 q2 + c3 q3

subject to:

c1 + c2 + c3 = 1 c1, c2, c3 ≥ 0

This is exactly the same structure as:

max over π Σ_a π(a|s) q(s,a)

Assume without loss of generality that:

q3 ≥ q1 and q3 ≥ q2

Then for any valid (c1, c2, c3):

c1 q1 + c2 q2 + c3 q3 ≤ c1 q3 + c2 q3 + c3 q3 = (c1 + c2 + c3) q3 = q3

So the objective is always ≤ q3.

Equality is achieved only when:

c3 = 1 c1 = c2 = 0

Therefore the maximum is obtained by putting all probability mass on the largest q-value.

Like 0

Liked Liked