Why Is the Optimal Policy Deterministic in Standard MDPs?

Something that confused me for a long time:

If policies are probability distributions

π(a | s) 

why is the optimal policy in a standard MDP deterministic?

Step 1 — Bellman Optimality

For any state s:

V*(s) = max over π of Σ_a π(a | s) * q*(s, a) 

where

q*(s, a) = r(s, a) + γ * Σ_{s'} P(s' | s, a) * V*(s') 

So at each state, we are solving:

max over π E_{a ~ π}[ q*(s, a) ] 

Step 2 — This Is Just a Weighted Average

Σ_a π(a | s) * q*(s, a) 

is a weighted average:

  • weights ≥ 0
  • weights sum to 1

And a weighted average is always ≤ the maximum element.

Equality holds only if all weight is placed on the maximum.

Step 3 — Conclusion

Therefore, the optimal policy can be written as:

π*(a | s) = 1 if a = argmax_a q*(s, a) = 0 otherwise 

The optimal policy can be chosen as a deterministic greedy policy.

So if the optimal policy in a standard MDP can always be chosen as deterministic and greedy…

why do most modern RL algorithms (PPO, SAC, policy gradients, etc.) explicitly learn stochastic policies?

Is it purely for exploration during training?
Is it an optimization trick to make gradients work?

————————————————————-

Proof (Why the optimum is deterministic)

Suppose we want to solve:

max over c1, c2, c3 of c1 q1 + c2 q2 + c3 q3 

subject to:

c1 + c2 + c3 = 1 c1, c2, c3 ≥ 0 

This is exactly the same structure as:

max over π Σ_a π(a|s) q(s,a) 

Assume without loss of generality that:

q3 ≥ q1 and q3 ≥ q2 

Then for any valid (c1, c2, c3):

c1 q1 + c2 q2 + c3 q3 ≤ c1 q3 + c2 q3 + c3 q3 = (c1 + c2 + c3) q3 = q3 

So the objective is always ≤ q3.

Equality is achieved only when:

c3 = 1 c1 = c2 = 0 

Therefore the maximum is obtained by putting all probability mass on the largest q-value.

submitted by /u/New-Yogurtcloset1818
[link] [comments]

Liked Liked