Implementation details of PPO only from paper and literature available at the time of publication?

digitado ⋅ 16 de January de 2026

Hi!

I’ve tried to implement PPO for Mujoco based only on the paper and resources available at the time of publication, without looking at any existing implementations of the algorithm.

I have now compared my implementation to the relevant details listed in The 37 Implementation Details of Proximal Policy Optimization, and it turns out I missed most details, see below.

My question is: Were these details documented somewhere, or have they been known implicitly in the community at the time? When not looking at existing implementations, what is the approach to figuring out these details?

Many thanks!

13 core implementation details

Implementation detail	My implementation	Comment
1. Vectorized architecture	N/A	According to the paper, the Mujoco benchmark does not use multiple environments in parallel. I didn’t yet encounter environments with longer episodes than the number of steps collected in each roll-out.
2. a) Orthogonal Initialization of Weights and Constant Initialization of biases	❌	I did not find this in the paper or any linked resources.
2. b) Policy output layer weights are initialized with the scale of 0.01	❌	Mentioned in Nuts and Bolts of Deep RL Experimentation around minute 30.
3. The Adam Optimizer’s Epsilon Parameter	❌	I don’t know the history of the Adam parameters well enough to suspect that anything else than PyTorch default parameters have been used.
4. Adam Learning Rate Annealing <br> In MuJoCo, the learning rate linearly decays from 3e-4 to 0.	❌	I don’t believe this is mentioned in the paper. Tables 3 – 5 give the impression a constant learning rate has been used for Mujoco.
5. Generalized Advantage Estimation	✅	This seems to be mentioned in the paper. I used 0 for the value function for the next observation after an environment was truncated or terminated.
6. Mini-batch Updates	✅	I use sampling without replacement of all time-steps across all episodes.
7. Normalization of Advantages	❌	I did not find this in the paper or any linked resources.
8. Clipped surrogate objective	✅	This is a key novelty and described in the paper.
9. Value Function Loss Clipping	❌	I did not find this in the paper or any linked resources.
10. Overall Loss and Entropy Bonus	N/A	Mentioned in the paper, but the Mujoco benchmark did not yet use it.
11. Global Gradient Clipping	❌	I did not find this in the paper or any linked resources.
12. Debug variables	N/A	This is not directly relevant for the algorithm to work.
13. Shared and separate MLP networks for policy and value functions	✅	It is mentioned that the Mujoco benchmark uses separate networks.

9 details for continuous action domains (e.g. Mujoco)

Implementation detail	My implementation	Comment
1. Continuous actions via normal distributions <br> 2. State-independent log standard deviation <br> 3. Independent action components <br> 4. Separate MLP networks for policy and value functions	✅	This is described in the PPO paper, or in references such as Benchmarking Deep Reinforcement Learning for Continuous Control and Trust Region Policy Optimization.
5. Handling of action clipping to valid range and storage	N/A	This is not mentioned in the PPO paper, and I used a “truncated” normal distribution, which only samples within a given interval according to the (appropriately upscaled) density function of a normal distribution. I haven’t tried using a clipped normal distribution because having 0 gradients in case the values are clipped seemed not natural to me.
6. Normalization of Observation <br> 7. Observation Clipping	❌	Mentioned in Nuts and Bolts of Deep RL Experimentation around minute 20.
8. Reward Scaling <br> 9. Reward Clipping	❌	A comment on this is also made in Nuts and Bolts of Deep RL Experimentation around minute 20, but I didn’t understand what exactly is meant.

submitted by /u/adrische
[link] [comments]

Like 0

Liked Liked