Embedding-Enhanced U-Net with a Patch-Based Approach and a Novel Dataset for Ground-Level Building Damage Segmentation
Ground-level building damage assessment captures critical structural details that remain invisible in satellite imagery, yet this perspective remains severely underexplored in current research. We address this gap by introducing the first war damage dataset for ground-level building segmentation, comprising high-resolution side-view images of war-affected Ukrainian buildings with pixel-wise annotations across six semantic classes: Other, Building, Roof, Damage, Damaged Roof, and Broken Window. To preserve original image resolution while enabling efficient deep learning, we employ a patch-based strategy that divides each image into fixed-size regions, generating thousands of training samples from the original dataset. We propose an embedding-enhanced U-Net framework that enriches each patch with global ConvNeXt-Large embeddings and positional encodings to provide scene-level context and spatial awareness. We systematically evaluate six encoder architectures(ResNet-50, SwinV2-Large, ConvNeXt-Large, YOLO11x-seg, DINOv2, and SegFormer-b5) across 48 configurations, testing both simplified three-class and complex six-class segmentation tasks with and without embedding integration and Felzenszwalb superpixel post-processing. Results demonstrate substantial performance gains from embedding integration: ResNet-50 achieved +7.81 pp IoU improvement, reaching 0.7743 IoU and 0.8982 F1-score for three-class segmentation, while DINOv2 attained optimal six-class performance with 0.4711 IoU and 0.7462 F1-score, representing a +4.65 pp IoU gain.