Vision–Language Multimodal Learning for UAV-Based Remote Sensing and Geospatial Artificial Intelligence: Tasks, Datasets, Benchmarks, and Foundation Models
Earth observation systems increasingly rely on multimodal data collected from satellites, aircraft, and unmanned aerial vehicles (UAVs). UAV platforms provide flexible, high-resolution, and real-time sensing capabilities that are essential for modern remote sensing, environmental monitoring, urban analysis, disaster response, and intelligent geospatial systems. However, the growing volume and diversity of aerial and geospatial data creates a substantial gap between large-scale visual observations and human-level semantic understanding.
Recent advances in vision–language multimodal learning and multimodal large language models offer new opportunities to bridge this gap by enabling cross-modal alignment between imagery and natural language. These techniques allow UAV and remote sensing imagery to be queried, interpreted, and analyzed through language-guided perception, multimodal reasoning, and interactive decision support.
This paper presents a comprehensive survey of vision–language multimodal learning for UAV-based remote sensing and geospatial artificial intelligence (GeoAI). We introduce a structured taxonomy of multimodal task formulations, including image–text retrieval, visual grounding, captioning, visual question answering, and multimodal reasoning for aerial and geospatial imagery. We systematically review representative datasets, benchmarks, model architectures, and evaluation protocols, and summarize recent progress in geospatial foundation models and multimodal large language models for Earth observation and UAV applications.
We further discuss key challenges in multimodal UAV sensing, including domain shift across sensors, spatial reasoning in aerial imagery, temporal modeling in Earth observation, multimodal alignment quality, and trustworthy deployment in real-world geospatial systems. This survey aims to organize the rapidly growing literature on multimodal learning for UAV and remote sensing applications and to highlight future research directions toward scalable, reliable, and reasoning-capable Earth observation and drone-based sensing systems.