February 2026

Learning Markov Decision Processes under Fully Bandit Feedback

digitado ⋅ 2 de February de 2026

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $Θ(sqrt{T})$-regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single emph{aggregate} reward. In this paper, we consider […]

Ver mais

Like 0

Liked Liked

technocracy

Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

digitado ⋅ 2 de February de 2026

The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT […]

Ver mais

Like 0

Liked Liked

technocracy

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

digitado ⋅ 2 de February de 2026

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling […]

Ver mais

Like 0

Liked Liked

technocracy

ECHO-2: A Large Scale Distributed Rollout Framework for Cost-efficient Reinforcement Learning

digitado ⋅ 2 de February de 2026

Ver mais

Like 0

Liked Liked

technocracy

Narwhals become quieter as the Arctic Ocean grows louder

digitado ⋅ 2 de February de 2026

For most of their evolutionary history, narwhals have relied more on sound than sight to survive in the Arctic’s dark icy waters. The speckled toothed whales—sometimes referred to as “unicorns of the sea” for the long, spiral tusks that protrude from the heads of males—navigate, hunt, and communicate using echolocation. By emitting a series of calls, whistles, and high frequency clicks—as many as a thousand per second—and listening for the echoes that bounce back, they are able to […]

Ver mais

Like 0

Liked Liked

technocracy

Learning Beyond the Gaussian Data: Learning Dynamics of Neural Networks on an Expressive and Cumulant-Controllable Data Model

digitado ⋅ 2 de February de 2026

We study the effect of high-order statistics of data on the learning dynamics of neural networks (NNs) by using a moment-controllable non-Gaussian data model. Considering the expressivity of two-layer neural networks, we first construct the data model as a generative two-layer NN where the activation function is expanded by using Hermite polynomials. This allows us to achieve interpretable control over high-order cumulants such as skewness and kurtosis through the Hermite coefficients while keeping the data model realistic. Using […]

Ver mais

Like 0

Liked Liked

technocracy

ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

digitado ⋅ 2 de February de 2026

Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of […]

Ver mais

Like 0

Liked Liked

technocracy

Learning Generative Selection for Best-of-N

digitado ⋅ 2 de February de 2026

Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, […]

Ver mais

Like 0

Liked Liked

technocracy

EvoMU: Evolutionary Machine Unlearning

digitado ⋅ 2 de February de 2026

Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine-tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well […]

Ver mais

Like 0

Liked Liked

technocracy

A decade of NFL Next Gen Stats innovation

digitado ⋅ 2 de February de 2026

Every snap in the NFL triggers a deluge of physical data. Twenty-two players accelerate, collide, and change direction in fractions of a second, while the ball traces a path through the controlled chaos. Yet for most of the sport’s history, much of that complexity went unmeasured. “Football, for 100-plus years, has been a box score game: you’ve got yards, you’ve got touchdowns, you’ve got tackles … ,” says Mike Band, senior manager of research and analytics with NFL’s […]

Ver mais

Like 0

Liked Liked