Understanding Uncertainty Sampling via Equivalent Loss
arXiv:2307.02719v4 Announce Type: replace-cross
Abstract: Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: There is no consensus on the proper definition of “uncertainty” for a specific task under a specific loss, nor a theoretical guarantee that prescribes a standard protocol to implement the algorithm. In this work, we systematically examine uncertainty sampling algorithms in the binary classification problem via a notion of equivalent loss which depends on the used uncertainty measure and the original loss function, and establish that an uncertainty sampling algorithm is optimizing against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, we give a sample complexity result for the equivalent loss, and later translate it into a binary loss guarantee via the surrogate link function. We prove the asymptotic superiority of the uncertainty sampling against the passive learning via this approach under mild conditions. We also discuss some potential extensions, including pool-based setting and potential generalization to the multi-class classification as well as the regression problems.