Conservative Risk-Sensitive Reinforcement Learning for Reliable Decision-Making Under Uncertainty

This paper addresses complex decision-making scenarios characterized by high uncertainty and high-cost errors, researching a risk-sensitive decision-oriented reinforcement learning mining method. It focuses on resolving the reliability issues arising from tail instability in the reward distribution and out-of-distribution actions under offline data conditions. Methodologically, the decision-making process is modeled using a Markov framework, with the reward distribution as the learning object to retain value information under adverse conditions. Based on this, a conditional risk-value metric is introduced to explicitly characterize and suppress tail risk, ensuring that policy optimization no longer relies solely on expected returns. To mitigate estimation bias and over-extrapolation in offline learning, conservative constraints based on behavioral distribution are further incorporated. By limiting the deviation between the policy and the implicit behavioral distribution in the data, out-of-distribution action expansion is suppressed, and the controllability of policy updates is improved. The overall framework unifies risk measurement and conservative learning into a single optimization form, forming a policy learning mechanism that balances returns and safety. Comparative experimental results show that this method exhibits superior overall performance in terms of average returns, tail reward robustness, and safety-related indicators, validating the effectiveness of the co-modeling of risk-sensitive objectives and conservative constraints, and providing an auditable and adjustable risk control approach for highly reliable intelligent decision-making systems.

Liked Liked