DPO pair: human-in-the-loop correction
I’ve been thinking about an approach for fine-tuning/RL on limited data and I’m not sure it’s the right one , curious if anyone has done something similar. i need a model that generates document templates from structured input + a nl comment. The only data I have are existing compiled templates, no input/output pairs. The idea is to bootstrap with reverse engineering, feed each template to a strong LLM, extract the parameters that could have generated it, use […]