Looking for guidance on my first DPO experiment, I have a tracing infrastructure that could make dataset building interesting

Hey everyone,

I’m fascinated by RL for LLMs. I have some SFT experience but none with RL, and I’d like to start experimenting with DPO.

Some context: Over time I’ve built a framework for building LLM agents that I use internally at the company where I work. It started as na side project but evolved quite a bit, i recently added a tracer and an MCP server for Claude on top of it.

What does this mean in practice? Claude (or any LLM) can access every intermediate step of agents and multi-agent systems built with the framework, including reasoning traces, tool calls, and intermediate outputs. I figured this could be a solid foundation for building preference datasets for RL, since you get full observability into what the model did and why.

My plan: Start with a simple DPO experiment using a small model (8B params, I have an RTX 4090) on a task with objective ground truth, so I can clearly measure before/after performance.

I’d appreciate any advice on:

– Dataset choice: What’s a good ground-truth benchmark to start with, where results are objectively verifiable? (I was thinking something like text-to-SQL with execution accuracy)

– Preference pair construction: Any tips on how to prompt an LLM judge to build high-quality chosen/rejected pairs from traces?

– Hyperparameters: Which ones are critical to get right for DPO training? What should I watch out for?

– Training metrics: What should I monitor to know if training is going well (or going off the rails)?

– Anything else you wish someone had told you before your first DPO run

If anyone has experience with this and wants to experiment together, feel free to DM me. The framework is here: https://github.com/GiulioSurya/Obelix — the tracer and MCP server aren’t public yet but the core agent endpoints are.

Really excited about this, any help is appreciated!

submitted by /u/Juno9419
[link] [comments]

Liked Liked