TD3 models trained with identical scripts produce very different behaviors

I’m a graduate research assistant working on autonomous vehicle research using TD3 in MetaDrive. I was given an existing training script by my supervisor. When the script trains, it produces a saved .zipmodel file (Stable-Baselines3 format).

My supervisor has a trained model .zip, and I trained my own model using what appears to be the exact same script : same reward function, wrapper, hyperparameters, architecture, and total timesteps.

Now here’s the issue: when I load the supervisor’s .zip into the evaluation script, it performs well. When I load my .zip (trained using the same script) into the same evaluation script, the behavior is very different.

To investigate, I compared both .zip files:

  • The internal architecture matches (same actor/critic structure).
  • The keys inside policy.pth are identical.
  • But the learned weights differ significantly.

I also tested both models on the same observation and printed the predicted actions. The supervisor’s model outputs small, smooth steering and throttle values, while mine often saturates steering or throttle near ±1. So the policies are clearly behaving differently.

The only differences I’ve identified so far are minor version differences (SB3 2.7.0 vs 2.7.1, Python 3.9 vs 3.10, slight Gymnasium differences), and I did not fix a random seed during training.

In continuous control with TD3, is it normal for two models trained separately (but with the same script) to end up behaving this differently just because of randomness?

Or does this usually mean something is not exactly the same in the setup?

If differences like this are not expected, where should I look?

submitted by /u/spyninj
[link] [comments]

Liked Liked