March 2026

Multi-label Instance-level Generalised Visual Grounding in Agriculture

digitado ⋅ 10 de March de 2026

arXiv:2603.06699v1 Announce Type: new Abstract: Understanding field imagery such as detecting plants and distinguishing individual crop and weed instances is a central challenge in precision agriculture. Despite progress in vision-language tasks like captioning and visual question answering, Visual Grounding (VG), localising language-referred objects, remains unexplored in agriculture. A key reason is the lack of suitable benchmark datasets for evaluating grounding models in field conditions, where many plants look highly similar, appear at multiple scales, and the referred target […]

Ver mais

Like 0

Liked Liked

technocracy

Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer

digitado ⋅ 10 de March de 2026

arXiv:2603.06698v1 Announce Type: new Abstract: Knowledge distillation between asymmetric architectures often induces severe geometric constraints on the learned representation space. In this work, we investigate the Dimensional Collapse phenomenon when distilling a 500M parameter global Vision Transformer (CLIP ViT-B/32) into strictly capacity-constrained, local-receptive-field CNNs (0.5M to 8.0M parameters) on the CIFAR-10 dataset. By employing strictly centered Singular Value Decomposition (SVD) and Variance-based Shannon Entropy Effective Rank, we isolate true structural variance from mean-vector artifacts. Our empirical results demonstrate […]

Ver mais

Like 0

Liked Liked

technocracy

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

digitado ⋅ 10 de March de 2026

arXiv:2603.06697v1 Announce Type: new Abstract: Vision–language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to […]

Ver mais

Like 0

Liked Liked

technocracy

HARP: HARmonizing in-vivo diffusion MRI using Phantom-only training

digitado ⋅ 10 de March de 2026

arXiv:2603.06696v1 Announce Type: new Abstract: Purpose: Combining multi-site diffusion MRI (dMRI) data is hindered by inter-scanner variability, which confounds subsequent analysis. Previous harmonization methods require large, matched or traveling human subjects from multiple sites, which are impractical to acquire in many situations. This study aims to develop a deep learning-based dMRI harmonization framework that eliminates the reliance on multi-site in-vivo traveling human data for training. Methods: HARP employs a voxel-wise 1D neural network trained on an easily transportable […]

Ver mais

Like 0

Liked Liked

technocracy

Soft Equivariance Regularization for Invariant Self-Supervised Learning

digitado ⋅ 10 de March de 2026

arXiv:2603.06693v1 Announce Type: new Abstract: Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance […]

Ver mais

Like 0

Liked Liked

technocracy

One-Shot Badminton Shuttle Detection for Mobile Robots

digitado ⋅ 10 de March de 2026

arXiv:2603.06691v1 Announce Type: new Abstract: This paper presents a robust one-shot badminton shuttlecock detection framework for non-stationary robots. To address the lack of egocentric shuttlecock detection datasets, we introduce a dataset of 20,510 semi-automatically annotated frames captured across 11 distinct backgrounds in diverse indoor and outdoor environments, and categorize each frame into one of three difficulty levels. For labeling, we present a novel semi-automatic annotation pipeline, that enables efficient labeling from stationary camera footage. We propose a metric […]

Ver mais

Like 0

Liked Liked

technocracy

Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind

digitado ⋅ 10 de March de 2026

arXiv:2603.06690v1 Announce Type: new Abstract: Geospatial Foundation Models (GFMs) typically lack native support for Hyperspectral Imaging (HSI) due to the complexity and sheer size of high-dimensional spectral data. This study investigates the adaptability of TerraMind, a multimodal GFM, to address HSI downstream tasks emph{without} HSI-specific pretraining. Therefore, we implement and compare two channel adaptation strategies: Naive Band Selection and physics-aware Spectral Response Function (SRF) grouping. Overall, our results indicate a general superiority of deep learning models with native […]

Ver mais

Like 0

Liked Liked

technocracy

High-Resolution Image Reconstruction with Unsupervised Learning and Noisy Data Applied to Ion-Beam Dynamics for Particle Accelerators

digitado ⋅ 10 de March de 2026

arXiv:2603.06689v1 Announce Type: new Abstract: Image reconstruction in the presence of severe degradation remains a challenging inverse problem, particularly in beam diagnostics for high-energy physics accelerators. As modern facilities demand precise detection of beam halo structures to control losses, traditional analysis tools have reached their performance limits. This work reviews existing image-processing techniques for data cleaning, contour extraction, and emittance reconstruction, and introduces a novel approach based on convolutional filtering and neural networks with optimized early-stopping strategies in […]

Ver mais

Like 0

Liked Liked

technocracy

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

digitado ⋅ 10 de March de 2026

arXiv:2603.06688v1 Announce Type: new Abstract: We present “Narrative Weaver”, a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences – a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained […]

Ver mais

Like 0

Liked Liked

technocracy

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

digitado ⋅ 10 de March de 2026

arXiv:2603.06687v1 Announce Type: new Abstract: Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal […]

Ver mais

Like 0

Liked Liked