technocracy

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

digitado ⋅ 17 de February de 2026

arXiv:2602.13214v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable […]

Ver mais

Like 0

Liked Liked

technocracy

Beyond Correctness: Learning Robust Reasoning via Transfer

digitado ⋅ 9 de February de 2026

Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), […]

Ver mais

Like 0

Liked Liked

technocracy

How I Built a Fail-Safe Legal AI Engine for Singapore Laws Using Triple-Model RAG

digitado ⋅ 16 de February de 2026

Operate under Singaporean laws and policies with a high-precision RAG engine with a triple-AI failover backend (Gemini/Llama/Groq). Constructed using Python and FAISS for semantic search, this open-source software provides industrial-grade reliability with an Apple-inspired UI for fast legal research.

Ver mais

Like 0

Liked Liked

technocracy

Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment

digitado ⋅ 30 de January de 2026

We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce […]

Ver mais

Like 0

Liked Liked

technocracy

Semantics in Actuation Systems: From Age of Actuation to Age of Actuated Information

digitado ⋅ 23 de January de 2026

arXiv:2601.15496v1 Announce Type: new Abstract: In this paper, we study the timeliness of actions in communication systems where actuation is constrained by control permissions or energy availability. Building on the Age of Actuation (AoA) metric, which quantifies the timeliness of actions independently of data freshness, we introduce a new metric, the emph{Age of Actuated Information (AoAI)}. AoAI captures the end-to-end timeliness of actions by explicitly accounting for the age of the data packet at the moment it is […]

Ver mais

Like 0

Liked Liked

technocracy

Evaluating Generalization Mechanisms in Autonomous Cyber Attack Agents

digitado ⋅ 12 de March de 2026

arXiv:2603.10041v1 Announce Type: new Abstract: Autonomous offensive agents often fail to transfer beyond the networks on which they are trained. We isolate a minimal but fundamental shift — unseen host/subnet IP reassignment in an otherwise fixed enterprise scenario — and evaluate attacker generalization in the NetSecGame environment. Agents are trained on five IP-range variants and tested on a sixth unseen variant; only the meta-learning agent may adapt at test time. We compare three agent families (traditional RL, adaptation […]

Ver mais

Like 0

Liked Liked

technocracy

Ryzen 9850X3D review: AMD’s bragging-rights gaming CPU gets more to brag about

digitado ⋅ 28 de January de 2026

AMD has released three distinct generations of its 3D V-Cache technology, which initially appeared in the Ryzen 7 5800X3D in 2022. The kernel of the idea has remained the same throughout AMD’s efforts: take an existing desktop processor design and graft 64MB of additional L3 cache onto it. This approach disproportionately helps apps that benefit from more cache, particularly games, and the size of the boost that 3D V-Cache gives to game performance has always been enough to […]

Ver mais

Like 0

Liked Liked

technocracy

New AWS tool recommends removal of unused permissions

digitado ⋅ 19 de December de 2024

New AWS tool recommends removal of unused permissions IAM Access Analyzer feature uses automated reasoning to recommend policies that remove unused accesses, helping customers achieve least privilege. Automated reasoning Loris D’Antoni Chungha Sung December 19, 03:16 PM December 19, 04:24 PM AWS Identity and Access Management (IAM) policies provide customers with fine-grained control over who has access to what resources in the Amazon Web Services (AWS) Cloud. This control helps customers enforce the principle of least privilege by […]

Ver mais

Like 0

Liked Liked

technocracy

When would you prefer DMPO over SAC for continuous control if real-world deployment is not the issue?

digitado ⋅ 20 de May de 2026

Hi everyone, I have been reading about Distributional Maximum a Posteriori Policy Optimization (DMPO), especially in the context of the DeepMind bipedal robot soccer paper, and I am trying to understand when one would practically prefer it over SAC. My current understanding is: SAC is a strong off-policy continuous-control baseline. It directly optimizes the actor using an entropy-regularized objective. It is widely implemented, easier to find baselines for, and generally very strong in simulation. On the other hand, […]

Ver mais

Like 0

Liked Liked

technocracy

A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing

digitado ⋅ 25 de April de 2026

In this tutorial, we explore kvcached, a dynamic KV-cache implementation on top of vLLM, to understand how dynamic KV-cache allocation transforms GPU memory usage for large language models. We begin by setting up the environment and deploying lightweight Qwen2.5 models through an OpenAI-compatible API, ensuring a realistic inference workflow. We then design controlled experiments where we simulate bursty workloads to observe how memory behaves under both elastic and static allocation strategies. Through systematic measurement and visualization, we directly […]

Ver mais

Like 0

Liked Liked