DPO pair: human-in-the-loop correction
I’ve been thinking about an approach for fine-tuning/RL on limited data and I’m not sure it’s the right one , curious if anyone has done something similar.
i need a model that generates document templates from structured input + a nl comment. The only data I have are existing compiled templates, no input/output pairs.
The idea is to bootstrap with reverse engineering, feed each template to a strong LLM, extract the parameters that could have generated it, use those as synthetic training inputs. Then fine-tune on that.
But the part I find more interesting is what happens after deployment. Instead of trying to build a perfect dataset upfront, you capture user feedback in production good/bad + a short explanation when something’s off. You use that text to generate corrected versions(using human feedback), build DPO pairs, and retrain iteratively ( the rejected is the one generated by the fine-tuned model the chosen is reconstructed by a larger LLM using the user’s feedback as guidance)
Essentially: treat the first deployed version as a data collection tool, not a finished product.
The tradeoff I see is that you’re heavily dependent on early user feedback quality, and if the initial model is too far off, the feedback loop starts from a bad baseline.
Has anyone gone this route? Does the iterative DPO approach actually hold up in practice or does it collapse after a few rounds?
submitted by /u/Juno9419
[link] [comments]