March 2026

Discounted Beta–Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

digitado ⋅ 19 de March de 2026

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples […]

Ver mais

Like 0

Liked Liked

technocracy

A lesser-known characterization of the gamma function

digitado ⋅ 19 de March de 2026

The gamma function Γ(z) extends the factorial function from integers to complex numbers. (Technically, Γ(z + 1) extends factorial.) There are other ways to extend the factorial function, so what makes the gamma function the right choice? The most common answer is the Bohr-Mollerup theorem. This theorem says that if f: (0, ∞) → (0, ∞) satisfies f(x + 1) = x f(x) f(1) = 1 log f is convex then f(x) = Γ(x). The theorem applies on […]

Ver mais

Like 0

Liked Liked

technocracy

Mathematical Foundations of Deep Learning

digitado ⋅ 19 de March de 2026

This draft book offers a comprehensive and rigorous treatment of the mathematical principles underlying modern deep learning. The book spans core theoretical topics, from the approximation capabilities of deep neural networks, the theory and algorithms of optimal control and reinforcement learning integrated with deep learning techniques, to contemporary generative models that drive today’s advances in artificial intelligence.

Ver mais

Like 0

Liked Liked

technocracy

Statistical Mechanics of Reinforcement Learning

digitado ⋅ 19 de March de 2026

Hello, fellow learners! Are there established connections between certain RL algorithms and certain physical systems? For example, the Hopfield network (a type of recurrent neural network) is related to spin glasses in condensed matter physics. Are there similar types of connections for traditional RL algorithms such as Q-learning, SARSA, TD(lamdbda), etc? I have heard that the Hamilton-Jacobi equation in classical mechanics is a special case of the Hamilton-Jacobi-Bellman equation, but I’m curious about other connections. I’m primarily asking […]

Ver mais

Like 0

Liked Liked

technocracy

Autoresearching Apple’s “LLM in a Flash” to run Qwen 397B locally

digitado ⋅ 19 de March de 2026

Autoresearching Apple’s “LLM in a Flash” to run Qwen 397B locally Here’s a fascinating piece of research by Dan Woods, who managed to get a custom version of Qwen3.5-397B-A17B running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk. Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be […]

Ver mais

Like 0

Liked Liked

technocracy

SB3 question.

digitado ⋅ 19 de March de 2026

I am working on a tron program for my cs class. I am using sb3 to use RL to create a bot for this project. I have to port the bot to the base python library so my teacher does not need to install any dependencies. I have worked with sb3 a bit for testing so I want avoid a cnn or multi input as it seems to cause some complexity when porting to pure python. my main […]

Ver mais

Like 0

Liked Liked

technocracy

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

digitado ⋅ 18 de March de 2026

Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where […]

Ver mais

Like 0

Liked Liked

technocracy

Kagi Translate’s AI answers the question “What would horny Margaret Thatcher say?”

digitado ⋅ 18 de March de 2026

If you’ve been using the Internet for any length of time, you’ve probably used a tool like Google Translate to convert webpages or snippets of text to and from languages ranging from Uzbek to Esperanto. But what if you want to translate into more esoteric “languages” like “LinkedIn Speak,” “Gen Z slang,” or “horny Margaret Thatcher”? This week, many people across the Internet have been bemused to find that the AI-powered Kagi Translate can perform these and countless […]

Ver mais

Like 0

Liked Liked

technocracy

Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning

digitado ⋅ 18 de March de 2026

Approximate subgraph matching (ASM) is a task that determines the approximate presence of a given query graph in a large target graph. Being an NP-hard problem, ASM is critical in graph analysis with a myriad of applications ranging from database systems and network science to biochemistry and privacy. Existing techniques often employ heuristic search strategies, which cannot fully utilize the graph information, leading to sub-optimal solutions. This paper proposes a Reinforcement Learning based Approximate Subgraph Matching (RL-ASM) algorithm […]

Ver mais

Like 0

Liked Liked

technocracy

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

digitado ⋅ 18 de March de 2026

Intracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording sessions. In realistic deployment, however, models must generalize to new sessions without labeled data, and performance often degrades due to cross-session nonstationarities (e.g., electrode shifts, neural turnover, and changes in user strategy). In this paper, we propose ALIGN, a session-invariant learning framework based on multi-domain adversarial neural networks for semi-supervised cross-session adaptation. ALIGN trains a feature encoder jointly […]

Ver mais

Like 0

Liked Liked