How Does the Discount Factor γ Change the Optimal Policy?
In a simple gridworld example, everything stays the same except the discount factor γ.
- Reward for boundary/forbidden: -1
- Reward for target: +1
- Only γ changes
Case 1: γ = 0.9
The agent is long-term oriented.
Future rewards are discounted slowly:
γ⁵ ≈ 0.59
So even if the agent takes a -1 penalty now (entering a forbidden area), the future reward is still valuable enough to justify it.
Result:
The optimal policy is willing to take short-term losses to reach the goal faster.
Case 2: γ = 0.5
The agent becomes short-sighted.
Future rewards shrink very quickly:
γ⁵ = 0.03125
Now immediate rewards dominate the decision.
The -1 penalty becomes too costly compared to the discounted future benefit.
Result:
The optimal policy avoids all forbidden areas and chooses safer but longer paths.
In short: A larger γ makes the agent more willing to accept short-term losses for long-term gains.
submitted by /u/New-Yogurtcloset1818
[link] [comments]