[D] Correct way to compare models
Hello.
I would like to hear your opinions about the practice of doing evaluations nowadays.
Previously, I worked in a domain with 2 or 3 well-established datasets. New architectures or improvements over existing models were consistently trained and evaluated on these datasets, which made it relatively straightforward to assess whether a paper provided a meaningful contribution.
I am shifting to a different topic, where the trend is to use large-scale models that can zero-shot/few-shot across many tasks. But now, it has become increasingly difficult to identify the true improvement, or it is simply more aggressive scaling and data usage for higher metrics.
For example, I have seen papers (at A* conf) that propose a method to improve a baseline and finetune it on additional data, and then compare against the original baseline without finetuning.
In other cases, some papers trained on the same data, but when I look into the configuration files, they simply use bigger backbones.
There are also works that heavily follow the llm/vlm trend and omit comparisons with traditional specialist models, even when they are highly relevant to the task.
Recently, I submitted a paper. We proposed a new training scheme and carefully selected baselines with comparable architectures and parameter counts to isolate and correctly assess our contribution. However, the reviewers requested comparisons with models with 10 or 100x more params, training data, and different input conditions.
Okay, we perform better in some cases (because unsurprisingly it’s our benchmark, tasks), we are also faster (obviously), but then what conclusion do I/they draw from such comparisons?
What do you think about this? As a reader, a reviewer, how can you pinpoint where the true contribution lies among a forest of different conditions? Are we becoming too satisfied with higher benchmark numbers?
submitted by /u/ntaquan
[link] [comments]