digitado – Page 38

Tricky$^2$: Towards a Benchmark for Evaluating Human and LLM Error Interactions

digitado ⋅ 28 de January de 2026

arXiv:2601.18949v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into software development workflows, yet they often introduce subtle logic or data-misuse errors that differ from human bugs. To study how these two error types interact, we construct Tricky$^2$, a hybrid dataset that augments the existing TrickyBugs corpus of human-written defects with errors injected by both GPT-5 and OpenAI-oss-20b across C++, Python, and Java programs. Our approach uses a taxonomy-guided prompting framework to generate machine-originated bugs […]

Ver mais

Like 0

Liked Liked

technocracy

Evaluating AI agents: Real-world lessons from building agentic systems at Amazon

digitado ⋅ 18 de February de 2026

The generative AI industry has undergone a significant transformation from using large language model (LLM)-driven applications to agentic AI systems, marking a fundamental shift in how AI capabilities are architected and deployed. While early generative AI applications primarily relied on LLMs to directly generate text and respond to prompts, the industry has evolved from those static, prompt-response paradigms toward autonomous agent frameworks to build dynamic, goal-oriented systems capable of tool orchestration, iterative problem-solving, and adaptive task execution in […]

Ver mais

Like 0

Liked Liked

technocracy

Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

digitado ⋅ 22 de April de 2026

arXiv:2604.18724v1 Announce Type: new Abstract: Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating on prompts for open-ended tasks. Informed by a formative study with researchers who use LMs (n=13) examining when stochasticity matters in practice, how […]

Ver mais

Like 0

Liked Liked

technocracy

VIRTUE: Versatile Video Retrieval Through Unified Embeddings

digitado ⋅ 21 de January de 2026

arXiv:2601.12193v1 Announce Type: new Abstract: Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval and fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VIRTUE, an MLLM-based versatile video […]

Ver mais

Like 0

Liked Liked

technocracy

Iterative Model-Learning Scheme via Gaussian Processes for Nonlinear Model Predictive Control of (Semi-)Batch Processes

digitado ⋅ 24 de April de 2026

Batch processes are inherently transient and typically nonlinear, motivating nonlinear model predictive control (NMPC). However, adopting NMPC is hindered by the cost and unavailability of dynamic models. Thus, we propose to use Gaussian Processes (GP) in a model-learning NMPC scheme (GP-MLMPC) for batch processes. We initialize the GP-MLMPC using data from a single initial trajectory, e.g., from a PI controller. We iteratively apply the NMPC embedded with GPs to run batches and update the GP with new observations […]

Ver mais

Like 0

Liked Liked

technocracy

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

digitado ⋅ 24 de March de 2026

arXiv:2603.20324v1 Announce Type: new Abstract: Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck — a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold $s^*$ (Proposition 1) that separates the regimes where diversity helps and hurts. In […]

Ver mais

Like 0

Liked Liked

technocracy

Quiz: How to Read User Input From the Keyboard in Python

digitado ⋅ 4 de June de 2026

In this quiz, you’ll test your understanding of How to Read User Input From the Keyboard in Python. By working through this quiz, you’ll revisit the input() function, type conversion, error handling with try and except, the getpass module for hidden input, and the PyInputPlus library for automatic validation. [ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to […]

Ver mais

Like 0

Liked Liked

technocracy

Less is more: Not all samples are effective for evaluation

digitado ⋅ 8 de January de 2026

arXiv:2601.03272v1 Announce Type: new Abstract: The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model […]

Ver mais

Like 0

Liked Liked

technocracy

Machine Learning for Energy-Performance-aware Scheduling

digitado ⋅ 30 de January de 2026

In the post-Dennard era, optimizing embedded systems requires navigating complex trade-offs between energy efficiency and latency. Traditional heuristic tuning is often inefficient in such high-dimensional, non-smooth landscapes. In this work, we propose a Bayesian Optimization framework using Gaussian Processes to automate the search for optimal scheduling configurations on heterogeneous multi-core architectures. We explicitly address the multi-objective nature of the problem by approximating the Pareto Frontier between energy and time. Furthermore, by incorporating Sensitivity Analysis (fANOVA) and comparing different […]

Ver mais

Like 0

Liked Liked

technocracy

Detoxification of large language models via regularized fine-tuning

digitado ⋅ 21 de November de 2024

Detoxification of large language models via regularized fine-tuning Attribute-controlled fine-tuning can produce LLMs that adhere to policy while achieving competitive performance on general benchmarks. Conversational AI Charith Peris November 21, 03:40 PM November 21, 03:40 PM Large language models (LLMs) have demonstrated impressive performance across a variety of tasks, but, as has been clear in multiple instances, they carry the risk of producing inappropriate, unsafe, or biased outputs. When generating responses, a successfully trained LLM should comply with […]

Ver mais

Like 0

Liked Liked