hop hop hop
Maybe you’ve seen my previous post about solving CartPole-v1 with just bitwise ops. I’ve tried to scale this approach to harder environments, but it didn’t get me too far. However, I was inspired by totally unrelated article – Eigenvalues as models. While the author is talking about matrices of size 3×3 and larger I went the other way – I restricted the weight matrix to be diagonal. This means the eigenvalues are simply the vector elements themselves. To get the maximum or minimum eigenvalue we literally just take the max or min value from the vector. Simple.
Now we can define a function M(x) that outputs these eigenvalues:
EIGEN(x) = A + xB
Where x is any scalar input and A and B are diagonal matrices – our parameters.
If you read the “Eigenvalues as models” article you know that we can take max of the eigenvalues to define a convex function and min to define a concave one:
convex(x) = max(EIGEN(x)) concave(x) = min(EIGEN(x))
Since the concave function is actually a convex one with flipped sign we can define the DC function which is a difference of two convex functions and it turns out it can approximate a lot of functions. So in our case it is actually a sum:
DC(x) = convex(x) + concave(x)
This gives us scalar back and as long as the number of eigenvalues is more than 2 (3,4,…) this function is non-linear and given enough eigenvalues we have quite powerful approximator! (when there are only 2 eigenvalues then the function collapses to just a sum of those 2 eigenvalues = linear)
We can easily extend it to high-dimensional inputs:
EIGEN(x1, x2, x3) = A + x1*B1 + x2*B2 + x3*B3
However, if EIGEN(x) remains linear, the resulting DC(x) is composed of flat planes, so not really great for “smooth” functions, so I made a small modification. I allowed the linear projection to “bend” itself by adding a quadratic term:
LINEAR(x1,x2,x3) = x1*B1 + x2*B2 + x3*B3 EIGEN(x1,x2,x3) = A + LINEAR(x1,x2,x3) + K * LINEAR(x1,x2,x3)^2
The K here are coefficients that define how much to “bend”. This hybrid can model both the sharp decision boundaries and smooth regions. For example a picture below is a perfect fit I trained using 4 eigenvalues showcasing the sharp decision in the middle and smooth wells on the left and right side:
Double Well Potential with sharp decision boundary
The only problem is that the min and max ops have issues with gradients – the gradient flows only to the winner, but this can be solved by using softmax in the backward pass (the softmax is a derivative of logsumexp which is a smooth approximation of max) – the STE trick. This works pretty well and we keep efficient min/max ops in the forward pass (inference).
Now my loose interpretation of the DC(x) function we’ve defined is that it represents a single neuron, but a special one that has multiple connections to a single input x.
So for the BipedalWalker-v3 problem I wanted to do the simplest thing possible. Since we have now “quite powerful” neuron, I just assigned 4 separate neurons controlling each joint independently. I trained them directly with PPO and somehow they have learnt to synchronize without any physical link between them.
There are no connections between the neurons. The left leg has no idea the right leg exists. The entire model is just 4 decentralized and stateless “Eigen / DC” neurons, each doing its own thing.
I’ve used 6 eigenvalues for each neuron and distilled the policy down to 69 lines of python code which you can just copy-paste and run if you have gymnasium and numpy installed. The entire logic for “hopping”/”walking” is literally here:
import numpy as np import gymnasium as gym A = np.array([ 0.167, 0.146, 0., -0.063, -0.110, 0.029, -0.114, 0.081, -0.101, -0.072, 0.094, -0.066, 0.238, -0.027, 0.019, -0.131, -0.018, 0.088, 0.046, 0.106, 0.062, 0.086, -0.134, 0.039, ]) B_GENERATOR = np.concatenate([np.linspace(-1.272, 1.491, 30), [0.0]]) B_IDX = np.array([ 0x51D9E52FCC93970, 0x8B16E9C669B3A7E, 0x8B14B3FB78A725D, 0xAC3D1745F8BDB3A, 0x9464F640CAF7989, 0x4F8EB62D4762DB2, 0x5A91E21DD052D6B, 0x4286A081D293E30, 0x6318E5797E7352C, 0x73E0C92DECF39EF, 0x6B54C4B0C882D48, 0x8ADFE73E2A5C9AE, 0x3A4C5491684AFCF, 0x8794C67A2D8B20C, 0x649AC52A2B539A9, 0x725EE779CA9314D, 0x7BD5E5321E7FBCA, 0x5BDEE431B0F4D6B, 0x4AD918359164A13, 0x62FCC6FBCC5A4EE, 0x4C97E433CE6226C, 0x4B9AB6910CF316F, 0xF79CC6A48A5AD4B, 0x3C0A848A1EF428A, 0x629CD421DE7C5D6, 0x6B9F5727DE5794B, 0x5C24677A1E8FBD3, 0x779EA879CCF212B, 0xF79DE73FCF5F9FE, 0xF323E8BDEE5B3CC, 0x639D27FA486B18B, 0x5B3DE73FDE5F96A, 0x53E2F726707BBC9, 0x93E2C4298D4392F, 0xF7BC863A6C73969, 0x5A96E8219E6318E, 0x4AD4FF2D7E74DDE, 0x6264D625E85C210, 0x5B98A7A614F7970, 0x7A60A6B59E5B14D, 0xF39C8F797E637CE, 0x731CB4799EF79C7, 0xF2A3E5B3CE8397E, 0x63D4E8A9928B96C, 0x839CB82D6C743CC, 0x7795EF29F1F2DAC, 0x67A4C43A6FF3DDE, 0x7560D8C1CA741CF, ], dtype=np.int64) K = np.array([ -0.037, 0.018, 0.027, -0.006, 0.021, 0.041, 0.017, -0.011, 0., 0.011, 0., 0.020, -0.025, -0.023, 0.015, 0.008, -0.012, 0., -0.096, 0., 0., 0.014, -0.039, 0., ]) def policy(state): shifts = np.arange(0, 60, 5, dtype=np.int64) indices = (B_IDX[:, None] >> shifts) & 0x1F idx = indices.flatten().reshape(24, 24) B = B_GENERATOR[idx] LINEAR = state @ B EIGEN = A + LINEAR + (K * (LINEAR**2)) EIGEN = EIGEN.reshape(4, 6) DC = np.max(EIGEN, axis=1) + np.min(EIGEN, axis=1) return np.clip(DC, -1, 1) def run(): env = gym.make("BipedalWalker-v3", render_mode=None) scores = [] print("Running 10 episodes...") for i in range(10): obs, _ = env.reset() ep_rew = 0 while True: action = policy(obs) obs, r, term, trunc, _ = env.step(action) ep_rew += r if term or trunc: break scores.append(ep_rew) print(f"Ep {i+1}: {ep_rew:.2f}") print("-" * 20) print(f"Avg: {np.mean(scores):.2f}") print(f"Min: {np.min(scores):.2f} Max: {np.max(scores):.2f}") env.close() if __name__ == "__main__": run()
This should get you average score of about 310 which is considered “solved” for this environment.
While it’s no longer just “bitwise ops” like in CartPole-v1 case I think it shares the same spirit.
=== EDIT ===
I just realized you can set all the K coefficients to ZERO and it does not hurt the performance. So the “quadratic term” and “smooth” part was not necessary after all (for this problem), so it is even less line of code 🙂