Migrated from PPO to SAC for multi-asset RL allocation — here’s what actually changed and why

digitado ⋅ 10 de February de 2026

I’ve been running RL agents for portfolio allocation across equities for a while now — daily OHLCV, quarterly fundamentals, TTM metrics, and options surface data as observations. Wanted to share some practical notes on migrating from PPO to SAC since most of the PPO vs SAC discussion I see online is benchmarked on MuJoCo, not financial data.

Why PPO stopped being sufficient

PPO worked fine on clean single-frequency daily data. The issues showed up when I introduced mixed-frequency observations:

Sample efficiency on finite data. This is the big one. On-policy means every rollout gets used for a few gradient steps and discarded. In sim environments you can generate infinite experience. With historical market data, your training set is fixed. Rare regimes (COVID vol spike, 2022 rate shock, etc.) get seen once and thrown away. The agent never develops robust behavior for tail events because it doesn’t revisit them.
Regime bias. PPO’s on-policy batches are dominated by whatever regime they happen to sample from. Over a full training run the policy converges toward behavior that works in the dominant regime. Global Sharpe looked fine. Regime-conditional Sharpe told a very different story — strong in trending, weak during transitions.
Entropy collapse. PPO naturally reduces policy entropy over training. In a non-stationary environment, that means the agent commits to one strategy and adjusts slowly when conditions change. Bad if you need the agent to maintain behavioral diversity across regimes.

What SAC changed

Replay buffer means rare regimes get revisited thousands of times. For finite-data environments this is the single biggest difference.
Entropy maximization keeps the policy from collapsing to one regime-specific strategy. The agent maintains diversity without explicit regime conditioning.
Smoother continuous action behavior for position sizing. Less erratic allocation adjustments during volatile periods.

Directional results: regime-conditional Sharpe improved, particularly during transitional periods. Max drawdown was comparable globally but better-distributed — fewer deep drawdowns clustered in specific market states.

What SAC doesn’t solve

Being honest about the tradeoffs:

Q-function overestimation with heavy-tailed reward distributions (financial data has plenty of these)
Replay buffer staleness in non-stationary environments — transitions from 3 years ago might actively mislead the agent about current market structure
Temperature tuning sensitivity to reward scale, which varies across market conditions

The thing I actually learned

The algorithm swap mattered less than rebuilding my evaluation to slice by regime. Once I could see performance conditioned on market state instead of just global aggregates, the decision was obvious. If you’re only looking at global Sharpe and max drawdown, you’re probably missing the most important signals.

I wrote a longer version with architecture diagrams and config examples if anyone wants the detail: Medium

The platform I run this on is open source if anyone wants to look at the experiment/evaluation setup: GitHub

Curious if others have run into similar issues with on-policy methods on finite, non-stationary data — financial or otherwise. Has anyone experimented with hybrid approaches like off-policy replay with on-policy updates? And for those using SAC on real-world sequential decision problems: how are you handling replay buffer staleness when the environment dynamics shift over time?

submitted by /u/iammuphasa
[link] [comments]

Like 0

Liked Liked