Hi everyone!
A little over a month ago, I started working on Wizwand project and lanched the first version here because PWC was sunsetted by HF.
Today, we just finished a big update for v2. After seeing some data issues from the old version, I focused on improving these two part:
- Dataset inconsistency (the “apples-to-apples” problem):
- If one method’s evaluation uses val and another uses test, is that apples-to-apples? If one uses ImageNet-1K but 512×512, should it live on the same leaderboard as standard 224×224
- In v1, describing the dataset as data structure was vague (because there are so many variants and different ways to use datasets), and a missing attribute or descriptor could cause non-fair comparison.
- In v2, instead of fully relying on using data structures to describe datasets, we started to use LLM – because it’s much accurate to describe the dataset in natual language and compare them. It turns out that it help reduced non-sense dataset comparison and grouping significantly.
- Task granularity (the “what even counts as the same task?” problem):
- In v1, we saw issues around how to organize and group tasks, such as “Image Classification” vs “Medical Image Classification” vs “Zero-shot Image Classfication”, etc. Can they be compared or not, and what are the parent/subtask relationship?
- In v2, we kept a simpler concept of domain/task labels (as categories), but removed the brittle parent/child taxonomy, aiming for a more precise benchmark definition
I’d love to invite you to try it out hot and share feedbacks, do you find it helpful, or what’s missing for you?
– You can try it out at wizwand.com
– If you are interested, I also wrote more details in a blog post about the new version
wizwand.com home page
wizwand.com benchmark page – example