[R] Dense process rewards from LLM feedback for multi-agent credit assignment

[R] Dense process rewards from LLM feedback for multi-agent credit assignment

https://preview.redd.it/w1eqpow7yihg1.jpg?width=3168&format=pjpg&auto=webp&s=4a5e9bbdad079c0e5fe0a4370f273786e18e53a3

We’ve been working on training multi-agent LLM systems end-to-end with RL. Two problems kept biting us:

Credit assignment. Pipeline fails, all agents share the same outcome reward. Agent 3 crashes because Agent 1 forgot to save a file? Both get penalized equally.

Sparse rewards. Multi-agent rollouts are expensive—dozens of LLM generations, tool executions, minutes per episode. One scalar at the end is a lot of supervision to leave on the table.

Approach

We use an external LLM as a “coach” that scores each agent action as it happens. The coach sees:

  • Agent role and instructions
  • Input context
  • Agent’s output
  • Tool feedback (stdout, stderr, errors)

This gives dense per-action rewards without ground truth labels. When something breaks, the coach traces through tool outputs to assign blame correctly.

Train with REINFORCE++ (clipped advantages, no critic needed). Each action gets its own reward signal.

Results

Math (3 agents: solver → coder → verifier):

  • AIME: +5 to +17.5pp
  • AMC: +7.8 to +17.2pp

Data Science (3 agents: data engineer → modeler → analyst):

  • Success rate: +16.7pp
  • Accuracy: +23%
  • F1 (classification): +38%
  • RMSE (regression): -41%

Links

Curious what others think about using LLM judgments as reward signals. The coach is obviously not perfect, but it beats outcome-only rewards for multi-agent setups.

submitted by /u/TapOnly5061
[link] [comments]

Liked Liked