February 2026

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

digitado ⋅ 20 de February de 2026

arXiv:2602.16763v1 Announce Type: new Abstract: Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning […]

Ver mais

Like 0

Liked Liked

technocracy

Attending to Routers Aids Indoor Wireless Localization

digitado ⋅ 20 de February de 2026

arXiv:2602.16762v1 Announce Type: new Abstract: Modern machine learning-based wireless localization using Wi-Fi signals continues to face significant challenges in achieving groundbreaking performance across diverse environments. A major limitation is that most existing algorithms do not appropriately weight the information from different routers during aggregation, resulting in suboptimal convergence and reduced accuracy. Motivated by traditional weighted triangulation methods, this paper introduces the concept of attention to routers, ensuring that each router’s contribution is weighted differently when aggregating information from […]

Ver mais

Like 0

Liked Liked

technocracy

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

digitado ⋅ 20 de February de 2026

arXiv:2602.16760v1 Announce Type: new Abstract: We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application […]

Ver mais

Like 0

Liked Liked

technocracy

Smooth trajectory generation and hybrid B-splines-Quaternions based tool path interpolation for a 3T1R parallel kinematic milling robot

digitado ⋅ 20 de February de 2026

arXiv:2602.16758v1 Announce Type: new Abstract: This paper presents a smooth trajectory generation method for a four-degree-of-freedom parallel kinematic milling robot. The proposed approach integrates B-spline and Quaternion interpolation techniques to manage decoupled position and orientation data points. The synchronization of orientation and arc-length-parameterized position data is achieved through the fitting of smooth piece-wise Bezier curves, which describe the non-linear relationship between path length and tool orientation, solved via sequential quadratic programming. By leveraging the convex hull properties of […]

Ver mais

Like 0

Liked Liked

technocracy

NESSiE: The Necessary Safety Benchmark — Identifying Errors that should not Exist

digitado ⋅ 20 de February de 2026

arXiv:2602.16756v1 Announce Type: new Abstract: We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general — but we argue that passing this test is necessary for any deployment. […]

Ver mais

Like 0

Liked Liked

technocracy

The Vulnerability of LLM Rankers to Prompt Injection Attacks

digitado ⋅ 20 de February de 2026

arXiv:2602.16752v1 Announce Type: new Abstract: Large Language Models (LLMs) have emerged as powerful re-rankers. Recent research has however showed that simple prompt injections embedded within a candidate document (i.e., jailbreak prompt attacks) can significantly alter an LLM’s ranking decisions. While this poses serious security risks to LLM-based ranking pipelines, the extent to which this vulnerability persists across diverse LLM families, architectures, and settings remains largely under-explored. In this paper, we present a comprehensive empirical study of jailbreak prompt […]

Ver mais

Like 0

Liked Liked

technocracy

A Construction-Phase Digital Twin Framework for Quality Assurance and Decision Support in Civil Infrastructure Projects

digitado ⋅ 20 de February de 2026

arXiv:2602.16748v1 Announce Type: new Abstract: Quality assurance (QA) during construction often relies on inspection records and laboratory test results that become available days or weeks after work is completed. On large highway and bridge projects, this delay limits early intervention and increases the risk of rework, schedule impacts, and fragmented documentation. This study presents a construction-phase digital twin framework designed to support element-level QA and readiness-based decision making during active construction. The framework links inspection records, material production […]

Ver mais

Like 0

Liked Liked

technocracy

LiveClin: A Live Clinical Benchmark without Leakage

digitado ⋅ 20 de February de 2026

arXiv:2602.16747v1 Announce Type: new Abstract: The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios […]

Ver mais

Like 0

Liked Liked

technocracy

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

digitado ⋅ 20 de February de 2026

arXiv:2602.16746v1 Announce Type: new Abstract: Grokking — the delayed transition from memorization to generalization in small algorithmic tasks — remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects — the non-commutativity of successive gradient steps — and […]

Ver mais

Like 0

Liked Liked

technocracy

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

digitado ⋅ 20 de February de 2026

arXiv:2602.16745v1 Announce Type: new Abstract: Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to […]

Ver mais

Like 0

Liked Liked