Selective State-Space Models in Medical Image Processing
In medical image analysis, modeling local and global features in high-resolution data presents a significant challenge. While the widely used Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies between distant pixels, the high computational cost (O(N2)) of Vision Transformer (ViT) architectures causes bottlenecks in clinical applications. This study investigates the integration of Mamba models which were developed to overcome these limitations and have linear complexity, into medical image analysis, along with recent studies in literature. This fundamentally continuous-time control theory-based architecture dynamically adapts to hardware resolution. The mamba models effectively retain anatomical structures and lesions in memory while filtering out irrelevant noise through their selective mechanism. Moreover, bidirectional scanning (Vision Mamba) and cross-scan (VMamba) methods are used to prevent the loss of spatial information and to overcome the necessity of processing one-dimensional data due to language-based structure of the models. The reviewed literature can be categorized under three main headings: hybrid models, efficient and lightweight designs, and spatial representation studies. Comprehensive analyses of literature indicate that Mamba models deliver significantly higher inference speed and memory efficiency compared to traditional CNN and ViT approaches owing to their hardware-aware design and linear computational efficiency. In conclusion, Mamba architecture has the potential to become a next-generation standard that demonstrates high performance while maintaining global contextual integrity across diverse medical fields such as radiology, ophthalmology, and dermatology.