Sensor Fusion and Perception for Autonomous Driving: A Critical Review of Modalities, AI Models, Algorithms, and Industry Configurations
Autonomous driving systems rely on a sophisticated pipeline of artificial intelligence models to perceive, predict, and plan in dynamic environments. This review presents a systematic analysis of the machine learning and deep learning models underpinning vehicle autonomy, spanning classical convolutional neural networks (CNNs) for object detection and semantic segmentation, to recurrent and Transformer-based architectures for trajectory prediction and motion planning. In this review, a critical examination of the autonomous vehicle sensor stack—including cameras, LiDAR, radar, ultrasonics, and GNSS/IMU as data acquisition systems, highlighting modality-specific AI challenges such as monocular depth estimation, 3D point cloud processing, and radar Doppler interpretation.
The evolution of perception and decision-making pipelines is reviewed, contrasting modular architectures with end-to-end learning paradigms that directly map raw sensor data to control commands, and discussing their trade-offs in interpretability, safety assurance, and robustness to rare edge cases. We further survey specialized hardware accelerators and heterogeneous automotive SoCs designed to meet stringent real-time and power constraints. Industrial strategies are compared, including multi-modal sensor fusion and vision-centric approaches based on large-scale imitation learning. Finally, we identify open challenges related to robustness under adverse conditions, domain shift, causal ambiguity, and the need for interpretable and certifiable AI in safety-critical autonomous driving systems.