Why does the greedy policy w.r.t. V* satisfy V* = V_{π*}?

digitado ⋅ 23 de February de 2026

I’m trying to understand the exact logic behind this key step in dynamic programming.

We know that V* satisfies the Bellman optimality equation:

V*(s) = max_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ]

Now define the greedy policy with respect to V*:

a*(s) = argmax_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ]

and define the deterministic policy:

π*(a|s) = 1 if a = a*(s) 0 otherwise

Step 1: Plug greedy action into Bellman optimality

Because π* selects the maximizing action:

V*(s) = r(s, a*(s)) + γ Σ_{s'} P(s'|s, a*(s)) V*(s')

This can be written compactly as:

V* = r_{π*} + γ P_{π*} V*

For any fixed policy π, its value function satisfies:

V_π = r_π + γ P_π V_π

This linear equation has a unique solution, since the Bellman operator
is a contraction mapping.

We just showed that V* satisfies the Bellman equation for π*:

V* = r_{π*} + γ P_{π*} V*

Since that equation has a unique solution, it follows that:

V* = V_{π*}

Therefore, the greedy policy w.r.t. V* is indeed optimal.

——————————————-

Proof (Contraction → existence/uniqueness → value iteration), for the Bellman optimality equation)

Let the Bellman optimality operator T be:

(Tv)(s) = max_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) v(s') ]

Equivalently (as in some slides):

v = f(v) = max_π ( r_π + γ P_π v )

where f=Tf = Tf=T.

Assume the standard discounted MDP setting (finite state/action or bounded rewards) and 0≤γ<10 ≤ γ < 10≤γ<1.
Use the sup norm:

||v||_∞ = max_s |v(s)|

Fix any two value functions v,wv,wv,w. For each state sss, define:

g_a(v;s) = r(s,a) + γ Σ_{s'} P(s'|s,a) v(s')

Then:

(Tv)(s) = max_a g_a(v;s) (Tw)(s) = max_a g_a(w;s)

Use the inequality:

|max_i x_i - max_i y_i| ≤ max_i |x_i - y_i|

So:

|(Tv)(s) - (Tw)(s)| = |max_a g_a(v;s) - max_a g_a(w;s)| ≤ max_a |g_a(v;s) - g_a(w;s)|

Now compute the difference inside:

|g_a(v;s) - g_a(w;s)| = |γ Σ_{s'} P(s'|s,a) (v(s') - w(s'))| ≤ γ Σ_{s'} P(s'|s,a) |v(s') - w(s')| ≤ γ ||v - w||_∞ Σ_{s'} P(s'|s,a) = γ ||v - w||_∞

Therefore for each sss:

|(Tv)(s) - (Tw)(s)| ≤ γ ||v - w||_∞

Taking max over sss:

||Tv - Tw||_∞ ≤ γ ||v - w||_∞

So T is a contraction mapping with modulus γ.

Since T is a contraction on the complete metric space (R∣S∣,∣∣⋅∣∣∞)(R^{|S|}, ||·||_∞)(R∣S∣,∣∣⋅∣∣∞), the Banach fixed-point theorem implies:

This is exactly: “BOE has a unique solution v∗v^*v∗”.

Define the iteration:

v_{k+1} = T v_k

By contraction:

||v_{k+1} - V*||_∞ = ||T v_k - T V*||_∞ ≤ γ ||v_k - V*||_∞

Apply repeatedly:

||v_k - V*||_∞ ≤ γ^k ||v_0 - V*||_∞

So convergence is geometric (“exponentially fast”), and the rate is determined by γγγ.

Once you have V∗V^*V∗, a greedy policy is:

π*(s) ∈ argmax_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ]

and it satisfies Vπ∗=V∗V_{π*} = V^*Vπ∗=V∗.

Like 0

Liked Liked