Confidence-Aware Gated Multimodal Fusion for Robust Temporal Action Localization in Occluded Environments
In industrial environments, robust Temporal Action Localization (TAL) is essential; however, frequent occlusions often compromise the reliability of skeletal data, leading to negative transfer in multimodal fusion. To address this challenge, we propose a Gated Skeleton Refinement Module (Gated SRM) that explicitly incorporates Open-Pose confidence scores into the network architecture. By applying these scores as a logarithmic bias within a self-attention mechanism, our method achieves soft suppression—dynamically attenuating the attention weights assigned to unreliable joints—before adaptively fusing the refined skeletal features with RGB representations through a learnable gating network. Extensive experiments on the heavily occluded IKEA ASM dataset demonstrate that our approach effectively prevents the catastrophic accuracy degradation typical of naive fusion strategies, improving the mean Average Precision (mAP) to 21.77% and outperforming the RGB-only baseline. Furthermore, the system maintains practical real-time inference speeds of approximately 16 frames per second (FPS). By prioritizing confidence-based data selection over data restoration, this sensor-metadata-driven architecture offers a highly robust and principled solution for real-world action recognition under occlusion.