[P] Three-Phase Self-Inclusive Evaluation Protocol for Synthetic Data Generation in a Fine-Tuned 4B Model (Experiment 3/100)
I’m documenting an ongoing series of reproducible experiments (this is #3 out of 100) exploring evaluation methodologies for small fine-tuned models in targeted synthetic data generation tasks.
The experiment implements a three-phase blind evaluation protocol:
- Generation Phase — Multiple models (one 4B fine-tuned + several frontier models) receive the identical proprietary prompt and produce responses.
- Analysis Phase — Each participant model performs a self-inclusive ranking of all generated outputs based on coherence, creativity, logical density, and human-likeness, assigning normalized percentage scores.
- Aggregation Phase — Results are compiled and summarized for overall ranking.
The setup is fully open-source (MIT license) with raw generations, individual analyses, and final aggregation available here:
https://github.com/Roforum/Xthos-v2-the-sovereign-architect-Model-Evaluation-Experiment
The goal is not to claim superiority but to investigate potential biases in LLM-as-judge setups, trade-offs in niche fine-tuning, and reproducibility of subjective evaluations. The protocol is lightweight and explicitly designed for community replication (local inference via Ollama supported).
I’d value feedback on:
- Methodological strengths/weaknesses (e.g., proprietary prompt limitations, self-ranking biases)
- Suggestions for more rigorous aggregation or statistical analysis
- Ideas for extending the protocol in future iterations
Looking forward to your thoughts on similar evaluation approaches or experiences with small-model fine-tuning trade-offs.
Thanks!
submitted by /u/AlexHardy08
[link] [comments]