AeroPinWorld: Revisiting Stride-2 Downsampling for Zero-Shot Transferable Open-Vocabulary UAV Detection

Open-vocabulary object detectors enable prompt-driven recognition, yet their zero-shot transfer to unmanned aerial vehicle (UAV) imagery remains fragile under domain shift, where tiny and cluttered targets depend on weak fine-grained cues. We propose AeroPinWorld, a pinwheel-augmented YOLO-World v2 that revisits stride-2 downsampling as a key transfer bottleneck: aggressive resolution reduction can induce aliasing-driven detail loss and sampling-phase sensitivity, which disproportionately harms small-object representations and degrades cross-dataset generalization in aerial scenes. To address this, AeroPinWorld introduces pinwheel-shaped convolution (PConv) as a phase-aware reduction operator. PConv probes complementary offsets via asymmetric padding and directional kernels before feature fusion, strengthening local structure aggregation at downsampling junctions. Importantly, we do not replace all downsampling operations; instead, we selectively substitute PConv at critical pyramid transitions, including the first two backbone reductions (P1/2 and P2/4) and the two bottom-up stride-2 reductions in the head, while keeping later backbone stages unchanged to preserve efficiency. We evaluate under a strict zero-shot cross-dataset protocol by training on COCO2017 for 24 epochs from official pretrained weights and directly testing on two UAV benchmarks, VisDrone2019-DET and UAVDT, without any target-domain fine-tuning, using an offline prompt vocabulary at inference. Experiments demonstrate consistent improvements over the baseline, achieving a +2.3 mAP gain and a +0.9 APS gain on VisDrone, and yielding consistent gains on UAVDT, while maintaining a competitive efficiency profile.

Liked Liked