PPO w/ RNN for Silkroad Online

digitado ⋅ 24 de March de 2026

I’ve spent the past almost two years learning RL and finally got to a state in my Silkroad Online project where the agent is able to learn a decent policy for the given reward function.

I plan to continue my work and my ultimate goal is to control an entire party of characters (8) for game modes like capture the flag, battle arena, and fortress war.

https://www.youtube.com/watch?v=a29y4Rbvt6U

In the video, the agents are PVPing against each other in 1vs1 fights. The RL algorithm used currently is Proximal Policy Optimization (PPO). The neural networks have a RNN component for memory (a GRU). One RL agent always fights against one “no-op” agent. The no-op agent does nothing. The RL agent makes the moves which the neural network thinks are best. Note that although the agent has access to a mana potion, I have that disabled. He is forced to choose his actions with limited mana.

The reward function has two components:

A small negative value proportional to the time elapsed. This incentivizes the agent to end the episode (by killing the opponent) as quickly as possible.
A very small negative value every time the agent chooses to send a packet over the network. This incentivizes the agent to minimize network traffic when all else is equal.

After just a few hours of training, the agent is able to converge on a set of strategies which minimize the episode duration at around 15 seconds. I don’t think there is a way to kill the opponent any faster than this, apart from some RNG luck.

My software is controlling 512 characters concurrently with plenty of headroom for more.

submitted by /u/SandSnip3r
[link] [comments]

Like 0

Liked Liked