January 2026

Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs

digitado ⋅ 29 de January de 2026

arXiv:2601.19918v1 Announce Type: new Abstract: Hallucinations in Large Language Models (LLMs), i.e., the tendency to generate plausible but non-factual content, pose a significant challenge for their reliable deployment in high-stakes environments. However, existing hallucination detection methods generally operate under unrealistic assumptions, i.e., either requiring expensive intensive sampling strategies for consistency checks or white-box LLM states, which are unavailable or inefficient in common API-based scenarios. To this end, we propose a novel efficient zero-shot metric called Lowest Span Confidence […]

Ver mais

Like 0

Liked Liked

technocracy

PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

digitado ⋅ 29 de January de 2026

arXiv:2601.19917v1 Announce Type: new Abstract: Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning […]

Ver mais

Like 0

Liked Liked

technocracy

PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

digitado ⋅ 29 de January de 2026

arXiv:2601.19916v1 Announce Type: new Abstract: Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection […]

Ver mais

Like 0

Liked Liked

technocracy

Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

digitado ⋅ 29 de January de 2026

arXiv:2601.19915v1 Announce Type: new Abstract: We introduce the emph{Arrow Language Model}, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a emph{left-nested implication chain} whose structure preserves order through non-commutative composition. Next-token prediction corresponds to emph{modus ponens}, and sequence processing becomes constructive proof extension under the Curry–Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among […]

Ver mais

Like 0

Liked Liked

technocracy

Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

digitado ⋅ 29 de January de 2026

arXiv:2601.19914v1 Announce Type: new Abstract: Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for […]

Ver mais

Like 0

Liked Liked

technocracy

From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

digitado ⋅ 29 de January de 2026

arXiv:2601.19913v1 Announce Type: new Abstract: Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for linguistically trained readers, who can over-trust surface well-formedness. We study whether expert detection can be treated as a learnable skill and improved through structured calibration. We introduce LREAD, a rubric derived from national Korean writing standards and adapted to target micro-level artifacts (e.g., punctuation optionality, spacing behavior, and register shifts). In a three-phase longitudinal blind protocol with Korean linguistics majors, Phase […]

Ver mais

Like 0

Liked Liked

technocracy

Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study

digitado ⋅ 29 de January de 2026

arXiv:2601.19912v1 Announce Type: new Abstract: Large language models (LLMs) are highly compute- and memory-intensive, posing significant demands on high-performance GPUs. At the same time, advances in GPU technology driven by shrinking transistor sizes and lower operating voltages have made these devices increasingly susceptible to soft errors. While prior work has examined GPU reliability, most studies have focused on general-purpose applications or conventional neural networks mostly used for vision tasks such as classification and detection. In contrast, systematic analysis […]

Ver mais

Like 0

Liked Liked

technocracy

GPU-Augmented OLAP Execution Engine: GPU Offloading

digitado ⋅ 29 de January de 2026

arXiv:2601.19911v1 Announce Type: new Abstract: Modern OLAP systems have mitigated I/O bottlenecks via storage-compute separation and columnar layouts, but CPU costs in the execution layer (especially Top-K selection and join probe) are emerging as new bottlenecks at scale. This paper proposes a hybrid architecture that augments existing vectorized execution by selectively offloading only high-impact primitives to the GPU. To reduce data movement, we use key-only transfer (keys and pointers) with late materialization. We further introduce a Risky Gate […]

Ver mais

Like 0

Liked Liked

technocracy

Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading

digitado ⋅ 29 de January de 2026

arXiv:2601.19910v1 Announce Type: new Abstract: KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks. In this paper, we develops an analytical framework that derives $kappa_{text{crit}}$, the critical cached-to-prefill token ratio where execution becomes memory-bound and show typical workloads exceed this threshold by orders of magnitude. Empirical characterization reveals 99% of latency spent on transfers and serving offloaded requests results in GPU’s consuming only 28% of their rated […]

Ver mais

Like 0

Liked Liked

technocracy

A Cache-Aware Hybrid Sieve Combining Segmentation and Bit-Packing for Fast Prime Generation

digitado ⋅ 29 de January de 2026

arXiv:2601.19909v1 Announce Type: new Abstract: Prime generation is a fundamental task in cryptography, number theory, and randomized algorithms. While the classical Sieve of Eratosthenes is simple and efficient in theory, its practical performance on modern central processing units is often limited by memory access inefficiencies. This paper introduces a cache-aware hybrid sieve that integrates segmentation, bit-packing, and cache-line-aligned block processing to optimize memory bandwidth and level one and level two cache locality. The proposed approach reduces memory usage […]

Ver mais

Like 0

Liked Liked