digitado

About digitado

https://www.digitado.com.br

Posts by :

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

digitado ⋅ 3 de March de 2026

arXiv:2603.01514v1 Announce Type: cross Abstract: We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection […]

Ver mais

Like 0

Liked Liked

technocracy

Randomized Kiring Believer for Parallel Bayesian Optimization with Regret Bounds

digitado ⋅ 3 de March de 2026

arXiv:2603.01470v1 Announce Type: cross Abstract: We consider an optimization problem of an expensive-to-evaluate black-box function, in which we can obtain noisy function values in parallel. For this problem, parallel Bayesian optimization (PBO) is a promising approach, which aims to optimize with fewer function evaluations by selecting a diverse input set for parallel evaluation. However, existing PBO methods suffer from poor practical performance or lack theoretical guarantees. In this study, we propose a PBO method, called randomized kriging believer […]

Ver mais

Like 0

Liked Liked

technocracy

Wild Bootstrap Inference for Non-Negative Matrix Factorization with Random Effects

digitado ⋅ 3 de March de 2026

arXiv:2603.01468v1 Announce Type: cross Abstract: Non-negative matrix factorization (NMF) is widely used for parts-based representations, yet formal inference for covariate effects is rarely available when the basis is learned under non-negativity. We introduce non-negative matrix factorization with random effects (NMF-RE), a mean-structure latent-variable model $Y=X(Theta A+U)+mathcal{E}$ that combines covariate-driven scores with unit-specific deviations. Random effects act as a working device for modeling heterogeneity and controlling complexity; we monitor their effective degrees of freedom and enforce a df-based cap […]

Ver mais

Like 0

Liked Liked

technocracy

Invariant-Stratified Propagation for Expressive Graph Neural Networks

digitado ⋅ 3 de March de 2026

arXiv:2603.01388v1 Announce Type: cross Abstract: Graph Neural Networks (GNNs) face fundamental limitations in expressivity and capturing structural heterogeneity. Standard message-passing architectures are constrained by the 1-dimensional Weisfeiler-Leman (1-WL) test, unable to distinguish graphs beyond degree sequences, and aggregate information uniformly from neighbors, failing to capture how nodes occupy different structural positions within higher-order patterns. While methods exist to achieve higher expressivity, they incur prohibitive computational costs and lack unified frameworks for flexibly encoding diverse structural properties. To address […]

Ver mais

Like 0

Liked Liked

technocracy

3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs

digitado ⋅ 3 de March de 2026

arXiv:2603.01376v1 Announce Type: cross Abstract: Sparse plus Low-Rank $(mathbf{S} + mathbf{LR})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $(mathbf{W} approx mathbf{S} + mathbf{LR})$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL-TM, an efficient one-shot post-training method for $(mathbf{S} + mathbf{LR})$ decomposition of LLMs […]

Ver mais

Like 0

Liked Liked

technocracy

Relatively Smart: A New Approach for Instance-Optimal Learning

digitado ⋅ 3 de March de 2026

arXiv:2603.01346v1 Announce Type: cross Abstract: We revisit the framework of Smart PAC learning, which seeks supervised learners which compete with semi-supervised learners that are provided full knowledge of the marginal distribution on unlabeled data. Prior work has shown that such marginal-by-marginal guarantees are possible for “most” marginals, with respect to an arbitrary fixed and known measure, but not more generally. We discover that this failure can be attributed to an “indistinguishability” phenomenon: There are marginals which cannot be […]

Ver mais

Like 0

Liked Liked

technocracy

PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

digitado ⋅ 3 de March de 2026

arXiv:2603.01309v1 Announce Type: cross Abstract: When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least $1-delta$, the learned policy is $varepsilon$-close to optimal after $N$ episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which […]

Ver mais

Like 0

Liked Liked

technocracy

Nonconvex Latent Optimally Partitioned Block-Sparse Recovery via Log-Sum and Minimax Concave Penalties

digitado ⋅ 3 de March de 2026

arXiv:2603.01304v1 Announce Type: cross Abstract: We propose two nonconvex regularization methods, LogLOP-l2/l1 and AdaLOP-l2/l1, for recovering block-sparse signals with unknown block partitions. These methods address the underestimation bias of existing convex approaches by extending log-sum penalty and the Minimax Concave Penalty (MCP) to the block-sparse domain via novel variational formulations. Unlike Generalized Moreau Enhancement (GME) and Bayesian methods dependent on the squared-error data fidelity term, our proposed methods are compatible with a broad range of data fidelity terms. […]

Ver mais

Like 0

Liked Liked

technocracy

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

digitado ⋅ 3 de March de 2026

arXiv:2603.01293v1 Announce Type: cross Abstract: Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality […]

Ver mais

Like 0

Liked Liked

technocracy

Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic

digitado ⋅ 3 de March de 2026

arXiv:2603.01162v1 Announce Type: cross Abstract: Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to […]

Ver mais

Like 0

Liked Liked