[R] Dense process rewards from LLM feedback for multi-agent credit assignment
|
We’ve been working on training multi-agent LLM systems end-to-end with RL. Two problems kept biting us: Credit assignment. Pipeline fails, all agents share the same outcome reward. Agent 3 crashes because Agent 1 forgot to save a file? Both get penalized equally. Sparse rewards. Multi-agent rollouts are expensive—dozens of LLM generations, tool executions, minutes per episode. One scalar at the end is a lot of supervision to leave on the table. ApproachWe use an external LLM as a “coach” that scores each agent action as it happens. The coach sees:
This gives dense per-action rewards without ground truth labels. When something breaks, the coach traces through tool outputs to assign blame correctly. Train with REINFORCE++ (clipped advantages, no critic needed). Each action gets its own reward signal. ResultsMath (3 agents: solver → coder → verifier):
Data Science (3 agents: data engineer → modeler → analyst):
Links
Curious what others think about using LLM judgments as reward signals. The coach is obviously not perfect, but it beats outcome-only rewards for multi-agent setups. submitted by /u/TapOnly5061 |