digitado – Page 270

Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

digitado ⋅ 15 de April de 2026

Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function […]

Ver mais

Like 0

Liked Liked

technocracy

Online Neural Networks for Change-Point Detection

digitado ⋅ 10 de March de 2026

arXiv:2010.01388v2 Announce Type: replace-cross Abstract: Moments when a time series changes its behavior are called change points. Occurrence of change point implies that the state of the system is altered and its timely detection might help to prevent unwanted consequences. In this paper, we present two change-point detection approaches based on neural networks and online learning. These algorithms demonstrate linear computational complexity and are suitable for change-point detection in large time series. We compare them with the best […]

Ver mais

Like 0

Liked Liked

technocracy

RECOR: Reasoning-focused Multi-turn Conversational Retrieval Benchmark

digitado ⋅ 12 de January de 2026

arXiv:2601.05461v1 Announce Type: new Abstract: Existing benchmarks treat multi-turn conversation and reasoning-intensive retrieval separately, yet real-world information seeking requires both. To bridge this gap, we present a benchmark for reasoning-based conversational information retrieval comprising 707 conversations (2,971 turns) across eleven domains. To ensure quality, our Decomposition-and-Verification framework transforms complex queries into fact-grounded multi-turn dialogues through multi-level validation, where atomic facts are verified against sources and explicit retrieval reasoning is generated for each turn. Comprehensive evaluation reveals that combining […]

Ver mais

Like 0

Liked Liked

technocracy

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

digitado ⋅ 5 de March de 2026

arXiv:2603.03291v1 Announce Type: new Abstract: Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize […]

Ver mais

Like 0

Liked Liked

technocracy

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

digitado ⋅ 6 de February de 2026

arXiv:2602.05725v1 Announce Type: cross Abstract: Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked […]

Ver mais

Like 0

Liked Liked

technocracy

I Walked Away From OpenAI’s New Codex /goal for 18 Hours — It Shipped 14 of 18 Features Solo

digitado ⋅ 4 de May de 2026

Last Tuesday at 4:47 PM I typed /goal ship the 18 features in BACKLOG.md before standup into OpenAI’s new Codex CLI 0.128.0, watched the… Continue reading on Towards AI »

Ver mais

Like 0

Liked Liked

technocracy

Scalar Federated Learning for Linear Quadratic Regulator

digitado ⋅ 8 de April de 2026

arXiv:2604.05088v1 Announce Type: new Abstract: We propose ScalarFedLQR, a communication-efficient federated algorithm for model-free learning of a common policy in linear quadratic regulator (LQR) control of heterogeneous agents. The method builds on a decomposed projected gradient mechanism, in which each agent communicates only a scalar projection of a local zeroth-order gradient estimate. The server aggregates these scalar messages to reconstruct a global descent direction, reducing per-agent uplink communication from O(d) to O(1), independent of the policy dimension. Crucially, […]

Ver mais

Like 0

Liked Liked

technocracy

Top 15 Software Development Trends to Watch in 2024

digitado ⋅ 26 de December de 2023

As we step into 2024, the landscape of software development continues to evolve exponentially, driven by technological innovations and changing market needs. For businesses and developers alike, staying abreast of these trends is not just beneficial—it’s essential for remaining competitive and successful. In this article, we explore the key software development trends expected to make a significant impact in 2024. General principles of software engineering in 2024 The general software development approaches set to define the IT industry […]

Ver mais

Like 0

Liked Liked

technocracy

MePoly: Max Entropy Polynomial Policy Optimization

digitado ⋅ 23 de February de 2026

arXiv:2602.17832v1 Announce Type: new Abstract: Stochastic Optimal Control provides a unified mathematical framework for solving complex decision-making problems, encompassing paradigms such as maximum entropy reinforcement learning(RL) and imitation learning(IL). However, conventional parametric policies often struggle to represent the multi-modality of the solutions. Though diffusion-based policies are aimed at recovering the multi-modality, they lack an explicit probability density, which complicates policy-gradient optimization. To bridge this gap, we propose MePoly, a novel policy parameterization based on polynomial energy-based models. MePoly […]

Ver mais

Like 0

Liked Liked

technocracy

Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales

digitado ⋅ 18 de March de 2026

arXiv:2603.15678v1 Announce Type: new Abstract: Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary — the emph{spectral edge} — between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $sigma_k/sigma_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits […]

Ver mais

Like 0

Liked Liked