Sparse Mixture of Experts for Game AI: An Accidental Architecture
|
I built a sparse MoE to train ML bots for Color Switch before I knew what one was. LSTM networks trained via PPO would overfit to obstacle subsets and fail to generalize. Routing inputs through clustered ensembles fixed it. The Problem Color Switch is a mobile game where players navigate obstacles by matching colors. I trained bots in a reinforcement learning setting via PPO. Individual networks would learn to pass ~30% of obstacles, then fail on the rest. Training new networks learned different subsets. No single network generalized. The Architecture
Each obstacle had metadata: colors, collider counts, rotation speeds, size. Encoded as min-max scaled feature vectors. K-means clustering grouped visually and mechanically similar obstacles naturally.
Separate ensembles (multiple LSTMs each) for each cluster, trained independently.
At inference: Identify approaching obstacle via spatial hash (O(1) lookup) Look up obstacle’s cluster ID Route observations to corresponding ensemble Weighted average of outputs → action Router was a cached lookup table. No learned routing, just precomputed K-means assignments. What Worked Generalization: Bot trained on Classic mode played 5 different modes without retraining. No previous architecture achieved this. Modular retraining: New obstacle in a cluster? Retrain one ensemble. Underperforming network? Retrain just that network. Ensembles trained in parallel. Emergent disentanglement: I now think of this as disentangling the manifold at a coarse level before networks learned finer representations. Obstacles with similar dynamics got processed together. The network didn’t have to learn “this is a circle thing” and “how to pass circle things” simultaneously. What Didn’t Work Random speed changes: Obstacles that changed speed mid-interaction broke the bots. Architecture helped but didn’t solve this. Superhuman performance: Never achieved. Ceiling was “good human player.” Connection to Transformer MoEs Didn’t know this was even called a sparse MoE until the GPT-4 leak. Same pattern: input arrives → router selects expert(s) → outputs combined. DeepSeek’s MoE paper describes “centroids” as expert identifiers with cosine similarity routing. Mine used Euclidean distance to K-means centroids. Same idea, less sophisticated. Takeaways Routing to specialized sub-networks based on input similarity works without transformers K-means on feature vectors produces surprisingly good routing clusters Modular architectures enable incremental retraining Generalization improved when I stopped training one network to handle everything Happy to answer implementation questions. submitted by /u/No_Apartment317 |