Deadlock and suboptimal coordination in CTDE Soft Actor-Critic with continuous training

digitado ⋅ 7 de May de 2026

I’m working on a cooperative MARL problem where agents need to complete their individual but interdependent tasks to reach a combined goal.

Methodology: (CTDE soft actor-critic learning)

I have defined a global reward + potential-based reward, both based on the global state. This is fed into the critic network. Furthermore, I use one actor network that receives the TD Error for every single agent. I’m training it continuously (not in episodes and without reset of the environment) but rather step-by-step. The global reward function is evaluated every step and that is also how the objectives are defined.

Outcome: Emergence of a deadlock. Majority of the time it works fine.

During inference, the agents are able to execute the individual tasks and thus the group tasks most of the time. However, sometimes some agents refuse to do particular tasks that look evident (up for grabs!) even though nothing is stopping them. Since these tasks are interdependent it stalls all my other agents: the group objective cannot be completed.

On top of that some agents that have nothing to do prefer to run around and do their own thing as the other agents are stopping them from starting their next task. I can only describe it as some form of deadlock. The global reward remains relatively constant albeit ‘jittery’.

Potential causes: Reward hacking, credit assignment or something with training?

I’m left to think there could be several causes.

An obvious one is that the definition of the reward function is not satisfactory. The policy of one particular agent prefers to do an alternative task. Since there is interdependency the policy of all the other agents are trained to become more random as a consequence in order to increase the chance of finding a reward. Could it be that the stalled against are thus set to reward hack alternative sparse rewards?
Since I use a single global reward for all agents, is it likely that “lazy” agents are being reinforced by the successes of others? Would transitioning to a QMIX-style value decomposition or decomposing my Potential-Based Reward (PBR) into agent-specific components significantly mitigate these deadlocks?
Or could it be because of the method of training? I train continuously, so if during the training process, there is a deadlock then of course the situation would not improve significantly over time. One way is to work with episodes and then reset to start fresh and circumvent the deadlock but then I am not sure how “fair” the steps vs reward evaluation is.

Remaining questions:

If an agent’s optimal action only yields a reward when other agents also do the right thing, the agent might learn to avoid that “good” action because it usually results in nothing (or a penalty) when others fail. How would you recommend auditing the critic to see if it’s properly valuing these interdependent actions?

Beyond reward shaping, what diagnostics would you use to determine if the deadlock is a failure of representation (agents don’t see the task) or coordination (agents see it but don’t value it)?

submitted by /u/Markovvy
[link] [comments]

Like 0

Liked Liked