digitado

How to Correctly Report LLM-as-a-Judge Evaluations

digitado ⋅ 6 de January de 2026

arXiv:2511.21140v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of LLM judgments induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and constructs confidence intervals accounting for uncertainty from both the test dataset and a human-evaluated calibration dataset, enabling statistically sound and practical LLM-based evaluation. Building on this framework, we introduce an […]

Ver mais

Like 0

Liked Liked

technocracy

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

digitado ⋅ 23 de February de 2026

arXiv:2602.17693v1 Announce Type: new Abstract: Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit […]

Ver mais

Like 0

Liked Liked

technocracy

Conditional Denoising Model as a Physical Surrogate Model

digitado ⋅ 30 de January de 2026

arXiv:2601.21021v1 Announce Type: new Abstract: Surrogate modeling for complex physical systems typically faces a trade-off between data-fitting accuracy and physical consistency. Physics-consistent approaches typically treat physical laws as soft constraints within the loss function, a strategy that frequently fails to guarantee strict adherence to the governing equations, or rely on post-processing corrections that do not intrinsically learn the underlying solution geometry. To address these limitations, we introduce the {Conditional Denoising Model (CDM)}, a generative model designed to learn […]

Ver mais

Like 0

Liked Liked

technocracy

Diablo 1 Agent Trained to Kill The Butcher Using Maskable PPO

digitado ⋅ 1 de February de 2026

TL;DR I trained a Maskable PPO agent to navigate Tristram and the first two levels of the cathedral and kill The Butcher in Diablo 1. You can grab the repo with a dedicated DevilutionX fork to train or evaluate the agent yourself (given you have an original valid copy of Diablo)! Training Repository DevilutionX Fork Evaluation Video Training Video Long(er) Version So I’ve been working on this project on and off for the past several months and decided […]

Ver mais

Like 0

Liked Liked

technocracy

Action Shapley: A Training Data Selection Metric for World Model in Reinforcement Learning

digitado ⋅ 16 de January de 2026

Numerous offline and model-based reinforcement learning systems incorporate world models to emulate the inherent environments. A world model is particularly important in scenarios where direct interactions with the real environment is costly, dangerous, or impractical. The efficacy and interpretability of such world models are notably contingent upon the quality of the underlying training data. In this context, we introduce Action Shapley as an agnostic metric for the judicious and unbiased selection of training data. To facilitate the computation […]

Ver mais

Like 0

Liked Liked

technocracy

The physics of squeaking sneakers

digitado ⋅ 26 de February de 2026

We’re all familiar with the high-pitched squeak of basketball shoes on the court during games, or tires squealing on pavement. Scientists conducted several experiments and discovered that the geometry of the sneakers’ tread patterns determines the squeak’s frequency, enabling the team to make rubber blocks set to specific frequencies and slide them across glass surfaces to play Star Wars’ “Imperial March.” “Tuning frictional behavior on the fly has been a long-standing engineering dream,” said co-author Katia Bertoldi of […]

Ver mais

Like 0

Liked Liked

technocracy

AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

digitado ⋅ 9 de March de 2026

arXiv:2603.05551v1 Announce Type: new Abstract: Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex […]

Ver mais

Like 0

Liked Liked

technocracy

The Machine Learning Lessons I’ve Learned This Month

digitado ⋅ 3 de March de 2026

February 2026: exchange with others, documentation, and MLOps The post The Machine Learning Lessons I’ve Learned This Month appeared first on Towards Data Science.

Ver mais

Like 0

Liked Liked

technocracy

Enhancing Speech Emotion Recognition using Dynamic Spectral Features and Kalman Smoothing

digitado ⋅ 28 de January de 2026

arXiv:2601.18908v1 Announce Type: new Abstract: Speech Emotion Recognition systems often use static features like Mel-Frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), and Root Mean Square Energy (RMSE). Because of this, they can misclassify emotions when there is acoustic noise in vocal signals. To address this, we added dynamic features using Dynamic Spectral features (Deltas and Delta-Deltas) along with the Kalman Smoothing algorithm. This approach reduces noise and improves emotion classification. Since emotion changes over time, the Kalman […]

Ver mais

Like 0

Liked Liked

technocracy

Bootstrapped Control Limits for Score-Based Concept Drift Control Charts

digitado ⋅ 13 de January de 2026

arXiv:2507.16749v2 Announce Type: replace-cross Abstract: Monitoring for changes in a predictive relationship represented by a fitted supervised learning model (i.e., concept drift detection) is a widespread problem in modern data-driven applications. A general and powerful Fisher score-based concept drift approach was recently proposed, in which detecting concept drift reduces to detecting changes in the mean of the model’s score vector using a multivariate exponentially weighted moving average (MEWMA). To implement the approach, the initial data must be split […]

Ver mais

Like 0

Liked Liked