Dual-Constrained Agentic PPO for Web Agents Under Multi-Cost Budgets and CVaR Failure Risk

Web agents must complete long-horizon browsing tasks while controlling heterogeneous operational costs (e.g., API calls, latency, and monetary fees) and avoiding catastrophic failures (e.g., irreversible clicks, account deletion, payment submission). We formulate web interaction as a constrained MDP with a multi-dimensional cumulative cost vector and a tail-risk objective on failure penalties. We propose DCAPPO, a dual-constrained policy optimization method that (i) enforces multi-cost budgets via primal–dual Lagrangian updates with per-cost adaptive multipliers, and (ii) minimizes CVaRα_alphaα​ of episodic failure loss using quantile regression on trajectory returns. To stabilize training under sparse success rewards, DCAPPO integrates a self-imitation buffer and a failure-aware advantage shaping that down-weights high-variance steps. We recommend evaluation on BrowserGym/WebArena-style environments with 1,200–1,800 tasks spanning 40–80 website templates, reporting (a) task success rate, (b) mean cost per success, (c) CVaR0.1_{0.1}0.1​ failure loss, and (d) constraint violation frequency. In ablations, DCAPPO isolates gains from CVaR control and per-cost dual updates, targeting a consistent reduction in tail failures under fixed cost budgets.

Liked Liked