Why does the greedy policy w.r.t. V* satisfy V* = V_{π*}?
I’m trying to understand the exact logic behind this key step in dynamic programming.
We know that V* satisfies the Bellman optimality equation:
V*(s) = max_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ]
Now define the greedy policy with respect to V*:
a*(s) = argmax_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ]
and define the deterministic policy:
π*(a|s) = 1 if a = a*(s) 0 otherwise
Step 1: Plug greedy action into Bellman optimality
Because π* selects the maximizing action:
V*(s) = r(s, a*(s)) + γ Σ_{s'} P(s'|s, a*(s)) V*(s')
This can be written compactly as:
V* = r_{π*} + γ P_{π*} V*
Step 2: Compare with policy evaluation equation
For any fixed policy π, its value function satisfies:
V_π = r_π + γ P_π V_π
This linear equation has a unique solution, since the Bellman operator
is a contraction mapping.
Step 3: Conclude equality
We just showed that V* satisfies the Bellman equation for π*:
V* = r_{π*} + γ P_{π*} V*
Since that equation has a unique solution, it follows that:
V* = V_{π*}
Intuition
- Bellman optimality gives V*
- Greedy extraction gives π*
- V* satisfies the Bellman equation for π*
- Uniqueness implies V* = V_{π*}
Therefore, the greedy policy w.r.t. V* is indeed optimal.
——————————————-
Proof (Contraction → existence/uniqueness → value iteration), for the Bellman optimality equation)
Let the Bellman optimality operator T be:
(Tv)(s) = max_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) v(s') ]
Equivalently (as in some slides):
v = f(v) = max_π ( r_π + γ P_π v )
where f=Tf = Tf=T.
Assume the standard discounted MDP setting (finite state/action or bounded rewards) and 0≤γ<10 ≤ γ < 10≤γ<1.
Use the sup norm:
||v||_∞ = max_s |v(s)|
1) Contraction property: ||Tv – Tw||∞ ≤ γ ||v – w||∞
Fix any two value functions v,wv,wv,w. For each state sss, define:
g_a(v;s) = r(s,a) + γ Σ_{s'} P(s'|s,a) v(s')
Then:
(Tv)(s) = max_a g_a(v;s) (Tw)(s) = max_a g_a(w;s)
Use the inequality:
|max_i x_i - max_i y_i| ≤ max_i |x_i - y_i|
So:
|(Tv)(s) - (Tw)(s)| = |max_a g_a(v;s) - max_a g_a(w;s)| ≤ max_a |g_a(v;s) - g_a(w;s)|
Now compute the difference inside:
|g_a(v;s) - g_a(w;s)| = |γ Σ_{s'} P(s'|s,a) (v(s') - w(s'))| ≤ γ Σ_{s'} P(s'|s,a) |v(s') - w(s')| ≤ γ ||v - w||_∞ Σ_{s'} P(s'|s,a) = γ ||v - w||_∞
Therefore for each sss:
|(Tv)(s) - (Tw)(s)| ≤ γ ||v - w||_∞
Taking max over sss:
||Tv - Tw||_∞ ≤ γ ||v - w||_∞
So T is a contraction mapping with modulus γ.
2) Existence + uniqueness of V* (fixed point)
Since T is a contraction on the complete metric space (R∣S∣,∣∣⋅∣∣∞)(R^{|S|}, ||·||_∞)(R∣S∣,∣∣⋅∣∣∞), the Banach fixed-point theorem implies:
-
There exists a fixed point V∗V^*V∗ such that:
V* = TV*
-
The fixed point is unique.
This is exactly: “BOE has a unique solution v∗v^*v∗”.
3) Algorithm: Value Iteration converges exponentially fast
Define the iteration:
v_{k+1} = T v_k
By contraction:
||v_{k+1} - V*||_∞ = ||T v_k - T V*||_∞ ≤ γ ||v_k - V*||_∞
Apply repeatedly:
||v_k - V*||_∞ ≤ γ^k ||v_0 - V*||_∞
So convergence is geometric (“exponentially fast”), and the rate is determined by γγγ.
Once you have V∗V^*V∗, a greedy policy is:
π*(s) ∈ argmax_a [ r(s,a) + γ Σ_{s'} P(s'|s,a) V*(s') ]
and it satisfies Vπ∗=V∗V_{π*} = V^*Vπ∗=V∗.
submitted by /u/New-Yogurtcloset1818
[link] [comments]