SkillDiff: Quantifying Fine-Grained Skill Differences from Paired Demonstration Videos

digitado ⋅ 28 de February de 2026

Assessing skill levels from videos of human activities is critical for applications in sports coaching, surgical training, and workplace safety. Existing approaches typically assign a global skill score to a video, failing to localize where and how skilled performers differ from novices. We propose SkillDiff, a framework that quantifies fine-grained skill differences between paired demonstration videos at the temporal segment level. Our method first aligns expert and novice videos temporally through a learned alignment module, then computes per-segment skill difference embeddings that capture deviations in execution quality, timing efficiency, and motion patterns. SkillDiff introduces: (1) a Temporal Alignment Backbone that establishes dense frame correspondences between demonstrations of varying skill, (2) a Differential Skill Encoder that transforms alignment residuals into interpretable skill difference features, and (3) a Segment-Level Scoring Head that produces localized quality assessments. Experiments on BEST, Fis-V, and AQA-7 benchmarks show that SkillDiff achieves state-of-the-art correlation with expert annotations (Spearman rho=0.93 on BEST), while providing temporally localized feedback that existing global scoring methods cannot.

Like 0

Liked Liked