How Does the Discount Factor γ Change the Optimal Policy?

In a simple gridworld example, everything stays the same except the discount factor γ.

  • Reward for boundary/forbidden: -1
  • Reward for target: +1
  • Only γ changes

Case 1: γ = 0.9

The agent is long-term oriented.

Future rewards are discounted slowly:

γ⁵ ≈ 0.59

So even if the agent takes a -1 penalty now (entering a forbidden area), the future reward is still valuable enough to justify it.

Result:

The optimal policy is willing to take short-term losses to reach the goal faster.

Case 2: γ = 0.5

The agent becomes short-sighted.

Future rewards shrink very quickly:

γ⁵ = 0.03125

Now immediate rewards dominate the decision.

The -1 penalty becomes too costly compared to the discounted future benefit.

Result:

The optimal policy avoids all forbidden areas and chooses safer but longer paths.

In short: A larger γ makes the agent more willing to accept short-term losses for long-term gains.

submitted by /u/New-Yogurtcloset1818
[link] [comments]

Liked Liked