January 2026

Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures

digitado ⋅ 29 de January de 2026

arXiv:2601.19928v1 Announce Type: new Abstract: Reinforcement learning (RL) has catalyzed the emergence of Large Reasoning Models (LRMs) that have pushed reasoning capabilities to new heights. While their performance has garnered significant excitement, exploring the internal mechanisms driving these behaviors has become an equally critical research frontier. This paper provides a comprehensive survey of the mechanistic understanding of LRMs, organizing recent findings into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. By synthesizing these […]

Ver mais

Like 0

Liked Liked

technocracy

Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

digitado ⋅ 29 de January de 2026

arXiv:2601.19927v1 Announce Type: new Abstract: Large Language Models (LLMs)-based question answering (QA) systems play a critical role in modern AI, demonstrating strong performance across various tasks. However, LLM-generated responses often suffer from hallucinations, unfaithful statements lacking reliable references. Retrieval-Augmented Generation (RAG) frameworks enhance LLM responses by incorporating external references but also introduce new forms of hallucination due to complex interactions between the retriever and generator. To address these challenges, researchers have explored attribution-based techniques that ensure responses are […]

Ver mais

Like 0

Liked Liked

technocracy

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

digitado ⋅ 29 de January de 2026

arXiv:2601.19926v1 Announce Type: new Abstract: We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on 1,015 model results from a range of syntactic phenomena and interpretability methods. Our analysis shows that the state of the art presents a healthy variety of methods and data, but an over-focus on a single language (English), a single model (BERT), and phenomena that are easy to get at (like part of speech and agreement). […]

Ver mais

Like 0

Liked Liked

technocracy

Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

digitado ⋅ 29 de January de 2026

arXiv:2601.19925v1 Announce Type: new Abstract: Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM’s potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5’s consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across […]

Ver mais

Like 0

Liked Liked

technocracy

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

digitado ⋅ 29 de January de 2026

arXiv:2601.19924v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across […]

Ver mais

Like 0

Liked Liked

technocracy

Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

digitado ⋅ 29 de January de 2026

arXiv:2601.19923v1 Announce Type: new Abstract: As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs. This paper proposes Table-BiEval, a novel approach based on a human-free, self-supervised […]

Ver mais

Like 0

Liked Liked

technocracy

HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

digitado ⋅ 29 de January de 2026

arXiv:2601.19922v1 Announce Type: new Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair […]

Ver mais

Like 0

Liked Liked

technocracy

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

digitado ⋅ 29 de January de 2026

arXiv:2601.19921v1 Announce Type: new Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of […]

Ver mais

Like 0

Liked Liked

technocracy

PiC-BNN: A 128-kbit 65 nm Processing-in-CAM-Based End-to-End Binary Neural Network Accelerator

digitado ⋅ 29 de January de 2026

arXiv:2601.19920v1 Announce Type: new Abstract: Binary Neural Networks (BNNs), where weights and activations are constrained to binary values (+1, -1), are a highly efficient alternative to traditional neural networks. Unfortunately, typical BNNs, while binarizing linear layers (matrix-vector multiplication), still implement other network layers (batch normalization, softmax, output layer, and sometimes the input layer of a convolutional neural network) in full precision. This limits the area and energy benefits and requires architectural support for full precision operations. We propose […]

Ver mais

Like 0

Liked Liked

technocracy

FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

digitado ⋅ 29 de January de 2026

arXiv:2601.19919v1 Announce Type: new Abstract: Knowledge distillation is one of the most effective methods for model compression. Previous studies have focused on the student model effectively training the predictive distribution of the teacher model. However, during training, the student model may inherit the shortcomings of the teacher model, which can lead to a decline in generalization capacity. To mitigate this issue, we propose adaptive self-knowledge distillation (ASKD), which dynamically reduces the dependence of the teacher model to improve […]

Ver mais

Like 0

Liked Liked