Teaching Humans using Expert RL Policies

digitado ⋅ 16 de May de 2026

RL is powerful enough to train superhuman policies, especially in video games. But is there any research on how to leverage RL’s policy/value networks to improve human training speed? How can we apply behavioral cloning to humans?

Past research has shown that simply providing a human with optimal moves doesn’t improve their pattern recognition or performance, it only increases their reliance on the feedback, making them worse.

Humans use some form of RL to learn motor skills and are more sample-efficient than algorithms. So, using guidance from expert policies, we can teach humans to learn along optimal trajectories, reducing time wasted in exploration.

Surely, with the help of value predictions, one can determine whether an action was suboptimal, helping solve the credit assignment problem. But what are the optimal ways to signal that to a human(e.g., either provide a number on the screen, display red/green colors, or perhaps electrocute them?)

submitted by /u/MaxedUPtrevor
[link] [comments]

Like 0

Liked Liked