I’m training an AI to drive Indianapolis 500 in DOSBox using reinforcement learning

Hey everyone,

I’ve been working on a reinforcement learning project for the old DOS game **Indianapolis 500**, running through DOSBox. The goal is to train an AI driver that can learn to leave the pit area, stay on track, complete laps, recover from mistakes, and eventually race faster than my own human driving.

Video here:

Indianapolis 500 Game – AI training

The setup uses a mix of:

– **Pixel input** from the DOSBox window

– **Keyboard control** for throttle, brake, left, right, etc.

– **Game-memory telemetry** read directly from DOSBox memory

– **Behavior cloning** from my own recorded driving

– **Recurrent PPO**

– A custom **Transformer + LSTM PPO policy**

– A live reward dashboard so I can see what the agent is being rewarded or punished for

The telemetry currently includes things like:

“`text

speed

position/progress around the track

lap completion

wrong direction detection

wall contact / crash detection

damage / hard crash signals

“`

Lap detection is not done with OCR. Instead, the program watches a memory value that represents track position. When that value wraps from a high value back to a low value, and then confirms past a threshold near the start/finish area, it counts a completed lap. That made lap rewards much more reliable than trying to infer it from pixels.

The reward system currently gives positive reward for:

“`text

speed

forward progress

staying on track

finishing laps

finishing laps quickly

“`

And penalties for:

“`text

going off track

wall contact

wrong direction

heavy crashes

sitting under 10 mph for too long

“`

I also recorded around 17 human-driven laps and trained a behavior cloning model from that. It helped the agent learn the basic shape of the track, but it also showed an interesting problem: if I overweight rare actions like steering right, the model starts turning right too much and crashes. So now I’m moving more toward PPO fine-tuning, where the agent can improve from telemetry rewards instead of just copying my driving.

The current next step is training the Transformer+LSTM PPO agent longer, with resets on heavy crashes and long dormancy, so it learns that “crash and sit still” is a dead end.

It’s still very experimental, but it’s been really fun seeing an old racing sim become a reinforcement learning environment. Any feedback on reward design, recurrent PPO setup, or better ways to combine behavior cloning with PPO would be very welcome.

submitted by /u/Few-Night-4811
[link] [comments]

Liked Liked