March 2026

LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models

digitado ⋅ 31 de March de 2026

arXiv:2603.26771v1 Announce Type: new Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the […]

Ver mais

Like 0

Liked Liked

technocracy

Quantized Vision-Language Models for Damage Assessment: A Comparative Study of LLaVA-1.5-7B Quantization Levels

digitado ⋅ 31 de March de 2026

arXiv:2603.26770v1 Announce Type: new Abstract: Bridge infrastructure inspection is a critical but labor-intensive task requiring expert assessment of structural damage such as rebar exposure, cracking, and corrosion. This paper presents a comprehensive study of quantized Vision-Language Models (VLMs) for automated bridge damage assessment, focusing on the trade-offs between description quality, inference speed, and resource requirements. We develop an end-to-end pipeline combining LLaVA-1.5-7B for visual damage analysis, structured JSON extraction, and rule-based priority scoring. To enable deployment on consumer-grade […]

Ver mais

Like 0

Liked Liked

technocracy

Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption

digitado ⋅ 31 de March de 2026

arXiv:2603.26769v1 Announce Type: new Abstract: The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as […]

Ver mais

Like 0

Liked Liked

technocracy

Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models

digitado ⋅ 31 de March de 2026

arXiv:2603.26768v1 Announce Type: new Abstract: The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback […]

Ver mais

Like 0

Liked Liked

technocracy

A training-free framework for high-fidelity appearance transfer via diffusion transformers

digitado ⋅ 31 de March de 2026

arXiv:2603.26767v1 Announce Type: new Abstract: Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the […]

Ver mais

Like 0

Liked Liked

technocracy

JND-Guided Neural Watermarking with Spatial Transformer Decoding for Screen-Capture Robustness

digitado ⋅ 31 de March de 2026

arXiv:2603.26766v1 Announce Type: new Abstract: Screen-shooting robust watermarking aims to imperceptibly embed extractable information into host images such that the watermark survives the complex distortion pipeline of screen display and camera recapture. However, achieving high extraction accuracy while maintaining satisfactory visual quality remains an open challenge, primarily because the screen-shooting channel introduces severe and entangled degradations including Moir'{e} patterns, color-gamut shifts, perspective warping, and sensor noise. In this paper, we present an end-to-end deep learning framework that jointly […]

Ver mais

Like 0

Liked Liked

technocracy

Bitboard version of Tetris AI

digitado ⋅ 31 de March de 2026

arXiv:2603.26765v1 Announce Type: new Abstract: The efficiency of game engines and policy optimization algorithms is crucial for training reinforcement learning (RL) agents in complex sequential decision-making tasks, such as Tetris. Existing Tetris implementations suffer from low simulation speeds, suboptimal state evaluation, and inefficient training paradigms, limiting their utility for large-scale RL research. To address these limitations, this paper proposes a high-performance Tetris AI framework based on bitboard optimization and improved RL algorithms. First, we redesign the Tetris game […]

Ver mais

Like 0

Liked Liked

technocracy

Low Dose CT for Stroke Diagnosis: A Dual Pipeline Deep Learning Framework for Portable Neuroimaging

digitado ⋅ 31 de March de 2026

arXiv:2603.26764v1 Announce Type: new Abstract: Portable CT scanners enable early stroke detection in prehospital and low-resource settings but require reduced radiation doses, introducing noise that degrades diagnostic reliability. We present a deep learning framework for stroke classification from simulated low-dose CT (LDCT) brain scans for AI-assisted triage in mobile clinical environments. Controlled Poisson noise is applied to high-dose CT images to simulate realistic LDCT conditions. We compare two pipelines: (1) direct classification of noisy LDCT images and (2) […]

Ver mais

Like 0

Liked Liked

technocracy

A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks

digitado ⋅ 31 de March de 2026

arXiv:2603.26763v1 Announce Type: new Abstract: Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native […]

Ver mais

Like 0

Liked Liked

technocracy

Tiny-ViT: A Compact Vision Transformer for Efficient and Explainable Potato Leaf Disease Classification

digitado ⋅ 31 de March de 2026

arXiv:2603.26761v1 Announce Type: new Abstract: Early and precise identification of plant diseases, especially in potato crops is important to ensure the health of the crops and ensure the maximum yield . Potato leaf diseases, such as Early Blight and Late Blight, pose significant challenges to farmers, often resulting in yield losses and increased pesticide use. Traditional methods of detection are not only time-consuming, but are also subject to human error, which is why automated and efficient methods are […]

Ver mais

Like 0

Liked Liked