Strategies for RL when the environment step involves costly simulation?
Hi Reddit,
Really new to RL here, but super curious and excited to learn from you guys.
I’m planning to work on a code-generation RL agent: The agent generates a program/configuration (Action), which is then compiled and run through a complex simulator (Environment) to calculate a performance metric (Reward).
The Bottleneck: The simulation takes several minutes to run. I cannot assume instant feedback.
The Question: Aside from massive parallelization, what algorithmic tricks exist for this ‘expensive reward’ regime? I’m looking at methods like GRPO or Model-Based RL but unsure if they would apply or scale to my challenges.
submitted by /u/QileHQ
[link] [comments]
Like
0
Liked
Liked