[D] Correct way to compare models
Hello. I would like to hear your opinions about the practice of doing evaluations nowadays. Previously, I worked in a domain with 2 or 3 well-established datasets. New architectures or improvements over existing models were consistently trained and evaluated on these datasets, which made it relatively straightforward to assess whether a paper provided a meaningful contribution. I am shifting to a different topic, where the trend is to use large-scale models that can zero-shot/few-shot across many tasks. But […]