Proposal for self-improving LLM reasoning

digitado ⋅ 20 de February de 2026

Ive come up with an adversarial RL design that could potentially push LLMs to superhuman level reasoning in a variety of domains.
The setup would involve 3 actors.

First is the problem generator. Its tasked to simply generate a problem and solution lets say for coding.

Second is the validator agent. this agent is frozen, all it does is take the problem generated by the solver and then asks some important questions like, “is the problem syntactically correct?” “How clear are the instructions?”

We then check the problem in this case code to see if it runs properly and the solution actually passes. If it doesnt pass we “re-roll”. Then we grade the solution by how “well-written” it is in according to these factors.

Third is the solver agent which is the main agent we are trying to improve its reasoning capabilities. The solver receives the problem from the generator. The solver is run to generate atleast 100 solutions with a decent temperature to provide variance.

Then we grade each solution by our metric for coding we will do accuracy, execution time, memory usage and how many lines of code(simpler the better)

Each grade is then normalized by the average and then we average those together by some factor determining the weights of each reward. giving us a final value telling us how good a solution is relative to all other solutions in the pool.

Then we run a reinforcement learning step over all the weights of the solver. Rewarding good solutions and penalizing bad solutions.

For the problem generator we also run a reinforcement learning step. But its grade is determined by two factors how “well-written” the problem is and then how close we got to a 50% pass rate. So, instead of solely trying to generate the hardest problem possible. we want to generate problems that get a 50% clear rate, which is just hard enough. The reason is to prevent unsolvable problems or malformed problems from being tested. But still providing enough selective pressure.

The expected result of this would be to push the AI to continuously solve harder problems thus improving its reasoning capabilities. The problem generator must learn to generate harder and more novel problems otherwise the solver will quickly learn the current problem and pass more than 50% of the time.

optional: a grounding step which is done by simply remixing popular problems in the domain. this prevents significant drift and ensures diversification.

This idea can also be extended to more domains. I was thinking math would work and for verbal reasoning and cleverness we could use riddles.

submitted by /u/Classic_Sheep
[link] [comments]

Like 0

Liked Liked