Strategies for RL when the environment step involves costly simulation?

digitado ⋅ 10 de January de 2026

Hi Reddit,

Really new to RL here, but super curious and excited to learn from you guys.

I’m planning to work on a code-generation RL agent: The agent generates a program/configuration (Action), which is then compiled and run through a complex simulator (Environment) to calculate a performance metric (Reward).

The Bottleneck: The simulation takes several minutes to run. I cannot assume instant feedback.

The Question: Aside from massive parallelization, what algorithmic tricks exist for this ‘expensive reward’ regime? I’m looking at methods like GRPO or Model-Based RL but unsure if they would apply or scale to my challenges.

submitted by /u/QileHQ
[link] [comments]

Like 0

Liked Liked