How Does the Discount Factor γ Change the Optimal Policy?

digitado ⋅ 23 de February de 2026

In a simple gridworld example, everything stays the same except the discount factor γ.

Case 1: γ = 0.9

The agent is long-term oriented.

Future rewards are discounted slowly:

γ⁵ ≈ 0.59

So even if the agent takes a -1 penalty now (entering a forbidden area), the future reward is still valuable enough to justify it.

Result:

The optimal policy is willing to take short-term losses to reach the goal faster.

The agent becomes short-sighted.

Future rewards shrink very quickly:

γ⁵ = 0.03125

Now immediate rewards dominate the decision.

The -1 penalty becomes too costly compared to the discounted future benefit.

Result:

The optimal policy avoids all forbidden areas and chooses safer but longer paths.

In short: A larger γ makes the agent more willing to accept short-term losses for long-term gains.

Like 0

Liked Liked