March 2026

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

digitado ⋅ 10 de March de 2026

arXiv:2603.06610v1 Announce Type: new Abstract: Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In […]

Ver mais

Like 0

Liked Liked

technocracy

Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test

digitado ⋅ 10 de March de 2026

arXiv:2603.06609v1 Announce Type: new Abstract: Modern machine learning models are highly expressive but notoriously difficult to analyze statistically. In particular, while black-box predictors can achieve strong empirical performance, they rarely provide valid hypothesis tests or p-values for assessing whether individual features contain information about a target variable. This article presents a practical approach to feature-level hypothesis testing that combines the Conditional Randomization Test (CRT) with TabPFN, a probabilistic foundation model for tabular data. The resulting procedure yields finite-sample […]

Ver mais

Like 0

Liked Liked

technocracy

Scaling Strategy, Not Compute: A Stand-Alone, Open-Source StarCraft II Benchmark for Accessible Reinforcement Learning Research

digitado ⋅ 10 de March de 2026

arXiv:2603.06608v1 Announce Type: new Abstract: The research community lacks a middle ground between StarCraft IIs full game and its mini-games. The full-games sprawling state-action space renders reward signals sparse and noisy, but in mini-games simple agents saturate performance. This complexity gap hinders steady curriculum design and prevents researchers from experimenting with modern Reinforcement Learning algorithms in RTS environments under realistic compute budgets. To fill this gap, we present the Two-Bridge Map Suite, the first entry in an open-source […]

Ver mais

Like 0

Liked Liked

technocracy

Multi-Agent DRL for V2X Resource Allocation: Disentangling Challenges and Benchmarking Solutions

digitado ⋅ 10 de March de 2026

arXiv:2603.06607v1 Announce Type: new Abstract: Multi-agent deep reinforcement learning (DRL) has emerged as a promising approach for radio resource allocation (RRA) in cellular vehicle-to-everything (C-V2X) networks. However, the multifaceted challenges inherent to multi-agent reinforcement learning (MARL) – including non-stationarity, coordination difficulty, large action spaces, partial observability, and limited robustness and generalization – are often intertwined, making it difficult to understand their individual impact on performance in vehicular environments. Moreover, existing studies typically rely on different baseline MARL algorithms, […]

Ver mais

Like 0

Liked Liked

technocracy

LegoNet: Memory Footprint Reduction Through Block Weight Clustering

digitado ⋅ 10 de March de 2026

arXiv:2603.06606v1 Announce Type: new Abstract: As the need for neural network-based applications to become more accurate and powerful grows, so too does their size and memory footprint. With embedded devices, whose cache and RAM are limited, this growth hinders their ability to leverage state-of-the-art neural network architectures. In this work, we propose textbf{LegoNet}, a compression technique that textbf{constructs blocks of weights of the entire model regardless of layer type} and clusters these induced blocks. Using blocks instead of […]

Ver mais

Like 0

Liked Liked

technocracy

Structure-Aware Set Transformers: Temporal and Variable-Type Attention Biases for Asynchronous Clinical Time Series

digitado ⋅ 10 de March de 2026

arXiv:2603.06605v1 Announce Type: new Abstract: Electronic health records (EHR) are irregular, asynchronous multivariate time series. As time-series foundation models increasingly tokenize events rather than discretizing time, the input layout becomes a key design choice. Grids expose time$times$variable structure but require imputation or missingness masks, risking error or sampling-policy shortcuts. Point-set tokenization avoids discretization but loses within-variable trajectories and time-local cross-variable context (Fig.1). We restore these priors in STructure-AwaRe (STAR) Set Transformer by adding parameter-efficient soft attention biases: a […]

Ver mais

Like 0

Liked Liked

technocracy

Know When You’re Wrong: Aligning Confidence with Correctness for LLM Error Detection

digitado ⋅ 10 de March de 2026

arXiv:2603.06604v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed in critical decision-making systems, the lack of reliable methods to measure their uncertainty presents a fundamental trustworthiness risk. We introduce a normalized confidence score based on output anchor token probabilities: classification labels for structured tasks and self-evaluation responses (Yes/No) for open-ended generation. This enables direct detection of errors and hallucinations with minimal overhead and without external validation. We make three key contributions. First, we propose […]

Ver mais

Like 0

Liked Liked

technocracy

Scale Dependent Data Duplication

digitado ⋅ 10 de March de 2026

arXiv:2603.06603v1 Announce Type: new Abstract: Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a “duplicate”: beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy […]

Ver mais

Like 0

Liked Liked

technocracy

Switchable Activation Networks

digitado ⋅ 10 de March de 2026

arXiv:2603.06601v1 Announce Type: new Abstract: Deep neural networks, and more recently large-scale generative models such as large language models (LLMs) and large vision-action models (LVAs), achieve remarkable performance across diverse domains, yet their prohibitive computational cost hinders deployment in resource-constrained environments. Existing efficiency techniques offer only partial remedies: dropout improves regularization during training but leaves inference unchanged, while pruning and low-rank factorization compress models post hoc into static forms with limited adaptability. Here we introduce SWAN (Switchable Activation […]

Ver mais

Like 0

Liked Liked

technocracy

FuzzingRL: Reinforcement Fuzz-Testing for Revealing VLM Failures

digitado ⋅ 10 de March de 2026

arXiv:2603.06600v1 Announce Type: new Abstract: Vision Language Models (VLMs) are prone to errors, and identifying where these errors occur is critical for ensuring the reliability and safety of AI systems. In this paper, we propose an approach that automatically generates questions designed to deliberately induce incorrect responses from VLMs, thereby revealing their vulnerabilities. The core of this approach lies in fuzz testing and reinforcement finetuning: we transform a single input query into a large set of diverse variants […]

Ver mais

Like 0

Liked Liked