Why does the greedy policy w.r.t. V* satisfy V* = V_{π*}?

I’m trying to understand the exact logic behind this key step in dynamic programming.

We know that V* satisfies the Bellman optimality equation:

V*(s) = max_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ] 

Now define the greedy policy with respect to V*:

a*(s) = argmax_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ] 

and define the deterministic policy:

π*(a|s) = 1 if a = a*(s) 0 otherwise 

Step 1: Plug greedy action into Bellman optimality

Because π* selects the maximizing action:

V*(s) = r(s, a*(s)) + γ Σ_{s'} P(s'|s, a*(s)) V*(s') 

This can be written compactly as:

V* = r_{π*} + γ P_{π*} V* 

Step 2: Compare with policy evaluation equation

For any fixed policy π, its value function satisfies:

V_π = r_π + γ P_π V_π 

This linear equation has a unique solution, since the Bellman operator
is a contraction mapping.

Step 3: Conclude equality

We just showed that V* satisfies the Bellman equation for π*:

V* = r_{π*} + γ P_{π*} V* 

Since that equation has a unique solution, it follows that:

V* = V_{π*} 

Intuition

  • Bellman optimality gives V*
  • Greedy extraction gives π*
  • V* satisfies the Bellman equation for π*
  • Uniqueness implies V* = V_{π*}

Therefore, the greedy policy w.r.t. V* is indeed optimal.

——————————————-

Proof (Contraction → existence/uniqueness → value iteration), for the Bellman optimality equation)

Let the Bellman optimality operator T be:

(Tv)(s) = max_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) v(s') ] 

Equivalently (as in some slides):

v = f(v) = max_π ( r_π + γ P_π v ) 

where f=Tf = Tf=T.

Assume the standard discounted MDP setting (finite state/action or bounded rewards) and 0≤γ<10 ≤ γ < 10≤γ<1.
Use the sup norm:

||v||_∞ = max_s |v(s)| 

1) Contraction property: ||Tv – Tw||∞ ≤ γ ||v – w||∞

Fix any two value functions v,wv,wv,w. For each state sss, define:

g_a(v;s) = r(s,a) + γ Σ_{s'} P(s'|s,a) v(s') 

Then:

(Tv)(s) = max_a g_a(v;s) (Tw)(s) = max_a g_a(w;s) 

Use the inequality:

|max_i x_i - max_i y_i| ≤ max_i |x_i - y_i| 

So:

|(Tv)(s) - (Tw)(s)| = |max_a g_a(v;s) - max_a g_a(w;s)| ≤ max_a |g_a(v;s) - g_a(w;s)| 

Now compute the difference inside:

|g_a(v;s) - g_a(w;s)| = |γ Σ_{s'} P(s'|s,a) (v(s') - w(s'))| ≤ γ Σ_{s'} P(s'|s,a) |v(s') - w(s')| ≤ γ ||v - w||_∞ Σ_{s'} P(s'|s,a) = γ ||v - w||_∞ 

Therefore for each sss:

|(Tv)(s) - (Tw)(s)| ≤ γ ||v - w||_∞ 

Taking max over sss:

||Tv - Tw||_∞ ≤ γ ||v - w||_∞ 

So T is a contraction mapping with modulus γ.

2) Existence + uniqueness of V* (fixed point)

Since T is a contraction on the complete metric space (R∣S∣,∣∣⋅∣∣∞)(R^{|S|}, ||·||_∞)(R∣S∣,∣∣⋅∣∣∞​), the Banach fixed-point theorem implies:

  • There exists a fixed point V∗V^*V∗ such that:

    V* = TV*

  • The fixed point is unique.

This is exactly: “BOE has a unique solution v∗v^*v∗”.

3) Algorithm: Value Iteration converges exponentially fast

Define the iteration:

v_{k+1} = T v_k 

By contraction:

||v_{k+1} - V*||_∞ = ||T v_k - T V*||_∞ ≤ γ ||v_k - V*||_∞ 

Apply repeatedly:

||v_k - V*||_∞ ≤ γ^k ||v_0 - V*||_∞ 

So convergence is geometric (“exponentially fast”), and the rate is determined by γγγ.

Once you have V∗V^*V∗, a greedy policy is:

π*(s) ∈ argmax_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ] 

and it satisfies Vπ∗=V∗V_{π*} = V^*Vπ∗​=V∗.

submitted by /u/New-Yogurtcloset1818
[link] [comments]

Liked Liked