[R] Dense process rewards from LLM feedback for multi-agent credit assignment

digitado ⋅ 4 de February de 2026

https://preview.redd.it/w1eqpow7yihg1.jpg?width=3168&format=pjpg&auto=webp&s=4a5e9bbdad079c0e5fe0a4370f273786e18e53a3

We’ve been working on training multi-agent LLM systems end-to-end with RL. Two problems kept biting us:

Credit assignment. Pipeline fails, all agents share the same outcome reward. Agent 3 crashes because Agent 1 forgot to save a file? Both get penalized equally.

Sparse rewards. Multi-agent rollouts are expensive—dozens of LLM generations, tool executions, minutes per episode. One scalar at the end is a lot of supervision to leave on the table.

Approach

We use an external LLM as a “coach” that scores each agent action as it happens. The coach sees:

Agent role and instructions
Input context
Agent’s output
Tool feedback (stdout, stderr, errors)

This gives dense per-action rewards without ground truth labels. When something breaks, the coach traces through tool outputs to assign blame correctly.

Train with REINFORCE++ (clipped advantages, no critic needed). Each action gets its own reward signal.

Results

Math (3 agents: solver → coder → verifier):

AIME: +5 to +17.5pp
AMC: +7.8 to +17.2pp

Data Science (3 agents: data engineer → modeler → analyst):

Success rate: +16.7pp
Accuracy: +23%
F1 (classification): +38%
RMSE (regression): -41%

[R] Dense process rewards from LLM feedback for multi-agent credit assignment

Approach

Results

Links