digitado – Page 448

Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment

digitado ⋅ 3 de March de 2026

arXiv:2603.00649v1 Announce Type: new Abstract: Assessing the correctness of patches generated by Automated Program Repair (APR) is a major bottleneck. Manual validation is labor-intensive and limited: exact matching overlooks valid variants, while semantic inspection is subjective and hard to reproduce. Existing Automated Patch Correctness Assessment (APCA) often relies on opaque predictive models that treat each patch as novel, repeatedly re-assessing semantically redundant patches. Our analysis of a large corpus of tool-generated patches reveals a duality: about 39% of […]

Ver mais

Like 0

Liked Liked

technocracy

hdlib 2.0: Extending Machine Learning Capabilities of Vector-Symbolic Architectures

digitado ⋅ 7 de January de 2026

arXiv:2601.02509v1 Announce Type: new Abstract: Following the initial publication of hdlib, a Python library for designing Vector-Symbolic Architectures (VSA), we introduce a major extension that significantly enhances its machine learning capabilities. VSA, also known as Hyperdimensional Computing, is a computing paradigm that represents and processes information using high-dimensional vectors. While the first version of hdlib established a robust foundation for creating and manipulating these vectors, this update addresses the growing need for more advanced, data-driven modeling within the […]

Ver mais

Like 0

Liked Liked

technocracy

UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

digitado ⋅ 11 de February de 2026

arXiv:2602.09130v1 Announce Type: new Abstract: Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques […]

Ver mais

Like 0

Liked Liked

technocracy

Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

digitado ⋅ 26 de January de 2026

arXiv:2601.16333v1 Announce Type: new Abstract: Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new […]

Ver mais

Like 0

Liked Liked

technocracy

Scenario theory for multi-criteria data-driven decision making

digitado ⋅ 2 de April de 2026

arXiv:2604.00553v1 Announce Type: new Abstract: The scenario approach provides a powerful data-driven framework for designing solutions under uncertainty with rigorous probabilistic robustness guarantees. Existing theory, however, primarily addresses assessing robustness with respect to a single appropriateness criterion for the solution based on a dataset, whereas many practical applications – including multi-agent decision problems – require the simultaneous consideration of multiple criteria and the assessment of their robustness based on multiple datasets, one per criterion. This paper develops a […]

Ver mais

Like 0

Liked Liked

technocracy

Opportunistic Scheduling for Optimal Spot Instance Savings in the Cloud

digitado ⋅ 21 de January de 2026

arXiv:2601.12266v1 Announce Type: new Abstract: We study the problem of scheduling delay-sensitive jobs over spot and on-demand cloud instances to minimize average cost while meeting an average delay constraint. Jobs arrive as a general stochastic process, and incur different costs based on the instance type. This work provides the first analytical treatment of this problem using tools from queuing theory, stochastic processes, and optimization. We derive cost expressions for general policies, prove queue length one is optimal for […]

Ver mais

Like 0

Liked Liked

technocracy

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?

digitado ⋅ 6 de February de 2026

arXiv:2505.20295v4 Announce Type: replace-cross Abstract: The common approach to communicate a large language model’s (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how […]

Ver mais

Like 0

Liked Liked

technocracy

What Makes Looped Transformers Perform Better Than Non-Recursive Ones

digitado ⋅ 7 de January de 2026

arXiv:2510.10089v3 Announce Type: replace-cross Abstract: While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the mechanism for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on […]

Ver mais

Like 0

Liked Liked

technocracy

Blind-Spot Mass: A Good-Turing Framework for Quantifying Deployment Coverage Risk in Machine Learning Systems

digitado ⋅ 6 de April de 2026

Blind-spot mass is a Good-Turing framework for quantifying deployment coverage risk in machine learning. In modern ML systems, operational state distributions are often heavy-tailed, implying that a long tail of valid but rare states is structurally under-supported in finite training and evaluation data. This creates a form of ‘coverage blindness’: models can appear accurate on standard test sets yet remain unreliable across large regions of the deployment state space. We propose blind-spot mass B_n(tau), a deployment metric estimating […]

Ver mais

Like 0

Liked Liked

technocracy

Ensemble of radiomics and ConvNeXt for breast cancer diagnosis

digitado ⋅ 12 de January de 2026

arXiv:2601.05373v1 Announce Type: new Abstract: Early diagnosis of breast cancer is crucial for improving survival rates. Radiomics and deep learning (DL) have shown significant potential in assisting radiologists with early cancer detection. This paper aims to critically assess the performance of radiomics, DL, and ensemble techniques in detecting cancer from screening mammograms. Two independent datasets were used: the RSNA 2023 Breast Cancer Detection Challenge (11,913 patients) and a Mexican cohort from the TecSalud dataset (19,400 patients). The ConvNeXtV1-small […]

Ver mais

Like 0

Liked Liked