SkillDiff: Quantifying Fine-Grained Skill Differences from Paired Demonstration Videos

Assessing skill levels from videos of human activities is critical for applications in sports coaching, surgical training, and workplace safety. Existing approaches typically assign a global skill score to a video, failing to localize where and how skilled performers differ from novices. We propose SkillDiff, a framework that quantifies fine-grained skill differences between paired demonstration videos at the temporal segment level. Our method first aligns expert and novice videos temporally through a learned alignment module, then computes per-segment skill difference embeddings that capture deviations in execution quality, timing efficiency, and motion patterns. SkillDiff introduces: (1) a Temporal Alignment Backbone that establishes dense frame correspondences between demonstrations of varying skill, (2) a Differential Skill Encoder that transforms alignment residuals into interpretable skill difference features, and (3) a Segment-Level Scoring Head that produces localized quality assessments. Experiments on BEST, Fis-V, and AQA-7 benchmarks show that SkillDiff achieves state-of-the-art correlation with expert annotations (Spearman rho=0.93 on BEST), while providing temporally localized feedback that existing global scoring methods cannot.

Liked Liked