RLHF Pipeline v2 (v3.0.0): Inference + Test-Time Compute Update (MCTS, A, Hidden Deliberation)
Hey guys im back again with that update I mentioned last night. The current internal experimental stack of the RLHF pipeline is now public in a form I am comfortable posting at this time. This version 2 update(tagged as v3.0.0) introduces the shift towards the “final/real” evolution of the stack. This release was planned post qwen3-pinion release as it has been a major validator for this new test time compute overhaul. This update focuses more on the inference optimization side introducing the hardened MCTS, A* search, hidden deliberation serve patterns, and a broader upscaling of the inference-time capabilities. This repo, unlike the neural router and memory system, can function as integratable tech directly into your personal systems, or with a little coding such as adapter for your model, yaml config editing etc and run straight in repo. It is again not “clone and play” but it is closer to being able to run in the codebase.I am framing this update through public literature and implementation maturity rather than branding it around any one closed-source system.
These updates are following a trail of public released work and innovations starting with Ilya Sutskever’s “Let’s Verify Step by Step.” The files rlhf.py handles the main runtime/training stack, while modules like inference_optimizations.py, inference_protocols.py, telemetry.py, and benchmark_harness.py extend it with process supervision, verifier-guided scoring, search, and test-time compute.
The exclusive control over post-training infrastructure has allowed a few organizations to artificially monopolize AI capabilities. They claim innovation while simply gating access to reinforcement learning, reward modeling, verifier-guided search, and test-time compute techniques. This repository is released under GPLv3 so the stack can be studied, modified, reproduced, and extended in the open.This repository removes that artificial barrier. By open sourcing an all in one RLHF runtime plus its surrounding inference, search, telemetry, and merge/export surfaces, I hope to achieve the goal to put reproduction of high-end post-training capability directly into the hands of the open-source community and reduce reliance on closed-source alignment and reasoning stacks. Some pay $2-100s of dollars for this level of model personalization and optimization, you now have all the tools needed. I personally trained qwen3-pinion (the model used to demonstrate some of the pipeline) on a laptop with an amd ryzen 5-5625u. With $3.99 per hour you can rent an H100 and bypass not only compute cost, but have total and complete control over any and all aspects.
Quick Clone Link:
Full-RLHF-Pipeline Repo: https://github.com/calisweetleaf/Reinforcement-Learning-Full-Pipeline
Drop 1, Neural Router + Memory system:
https://github.com/calisweetleaf/SOTA-Runtime-Core
Drop 3, Moonshine:
https://github.com/calisweetleaf/distill-the-flow
Additional Context:
Qwen3-pinion release can be found on huggingface and ollama, hf host the full weights of pinion (qwen3-1.7b full sft on MagpieMagpie-Align/Magpie-Pro-300K-Filtered, then the lora was merged into the base weights.) Multiple quant variations in gguf format exit on huggingface as well as ollama ranginging from f16, Q8_0, Q4_K_M, and Q5_K_M.
I welcome comments, questions, feedback, or general discussions and am more than happy to answer anything you may have questions about. This repo is GPLv3, you can do whatever you may please with it adhering to the terms of gpl, such as forking, pull request, collaboration, integration into your own open source systems. Thank you for your engagement and I hope this release adds value to the open source community!
submitted by /u/daeron-blackFyr
[link] [comments]