Has Anyone Seen DPO Hurt Classification Performance on Preference Training Data?
A Vision-Language Model (VLM) was fine-tuned using supervised fine-tuning (SFT) for a 10-class classification task. The resulting model achieved approximately 75% F1 score on the evaluation set and was subsequently deployed.
To further improve performance, preference data was collected from production for a specific task containing roughly 400 images. For each image:
The SFT model’s prediction was compared against a human-reviewed outcome.
Preference pairs were constructed using the model prediction as the rejected response and the human-corrected outcome as the preferred response.
DPO (Direct Preference Optimization) was then applied starting from the SFT checkpoint.
Unexpected Result
After DPO training, the updated model was evaluated on the same 400 images used to generate the preference dataset.
Surprisingly, the F1 score decreased compared to the original SFT model, despite the preference data being derived from those exact examples.
Questions
1. Has anyone observed DPO degrading classification metrics such as F1, even on the data used to construct the preference dataset?
-
Could this be due to a mismatch between the DPO objective and the underlying classification objective?
-
Is a preference dataset of only ~400 images likely too small or too noisy for effective DPO training?
-
Are there recommended best practices for applying DPO to multi-class classification tasks, particularly with VLMs?
-
Would alternative approaches be more appropriate in this scenario, such as:
* Additional SFT on corrected labels
* Mixing SFT and preference data during training
* ORPO
* KTO
* Reward modeling followed by optimization
Additional Context
* Task: 10-class image classification using a VLM
* Baseline SFT performance: ~75% F1
* Preference dataset size: ~400 images
* DPO initialized from the SFT checkpoint
* Evaluation performed on the same images used to construct the preference pairs
Any insights, debugging suggestions, references, or similar experiences with DPO for classification-oriented VLM tasks would be greatly appreciated.
submitted by /u/JustZookeepergame382
[link] [comments]