technocracy

Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe

digitado ⋅ 15 de January de 2026

arXiv:2601.08847v1 Announce Type: new Abstract: Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion – generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without […]

Ver mais

Like 0

Liked Liked

technocracy

Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science

digitado ⋅ 3 de February de 2026

Tokens are discrete representations that allow modern deep learning to scale by transforming high-dimensional data into sequences that can be efficiently learned, generated, and generalized to new tasks. These have become foundational for image and video generation and, more recently, physical simulation. As existing tokenizers are designed for the explicit requirements of realistic visual perception of images, it is necessary to ask whether these approaches are optimal for scientific images, which exhibit a large dynamic range and require […]

Ver mais

Like 0

Liked Liked

technocracy

mlr3mbo: Bayesian Optimization in R

digitado ⋅ 1 de April de 2026

arXiv:2603.29730v1 Announce Type: new Abstract: We present mlr3mbo, a comprehensive and modular toolbox for Bayesian optimization in R. mlr3mbo supports single- and multi-objective optimization, multi-point proposals, batch and asynchronous parallelization, input and output transformations, and robust error handling. While it can be used for many standard Bayesian optimization variants in applied settings, researchers can also construct custom BO algorithms from its flexible building blocks. In addition to an introduction to the software, its design principles, and its building […]

Ver mais

Like 0

Liked Liked

technocracy

General Motors writes down $6 billion as domestic EV sales plans change

digitado ⋅ 9 de January de 2026

American automakers who got overenthusiastic about electric vehicles continue to pay the price—literally. Yesterday, General Motors told investors that building and selling fewer EVs will cost the company $6 billion. Still, things could be worse—last month, rival Ford said it would write down $19.5 billion as a result of its failed EV bet. GM is not actually abandoning its EV portfolio, even as it reduces shifts at some plants and repurposes others—like the one in Orion, Michigan—into assembling […]

Ver mais

Like 0

Liked Liked

technocracy

Hypergraph Samplers: Typical and Worst Case Behavior

digitado ⋅ 29 de January de 2026

arXiv:2601.20039v1 Announce Type: new Abstract: We study the utility and limitations of using $k$-uniform hypergraphs $H = ([n], E)$ ($n ge mathrm{poly}(k)$) in the context of error reduction for randomized algorithms for decision problems with one- or two-sided error. Our error reduction idea is sampling a uniformly random hyperedge of $H$, and repeating the algorithm $k$ times using the hyperedge vertices as seeds. This is a general paradigm, which captures every pseudorandom method generating $k$ seeds without repetition. […]

Ver mais

Like 0

Liked Liked

technocracy

Research roundup: 6 cool science stories we almost missed

digitado ⋅ 2 de May de 2026

It’s a regrettable reality that there is never enough time to cover all the interesting scientific stories we come across. So every month, we highlight a handful of the best stories that nearly slipped through the cracks. April’s list includes tracking Roman ship repairs, the discovery that mushrooms can detect human urine, crushing soda cans for science, and the physics of why dolphins can swim so fast. Physics of why dolphins swim so fast Dolphins are very good […]

Ver mais

Like 0

Liked Liked

technocracy

MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation

digitado ⋅ 6 de February de 2026

arXiv:2602.05048v1 Announce Type: new Abstract: Joint planning through language-based interactions is a key area of human-AI teaming. Planning problems in the open world often involve various aspects of incomplete information and unknowns, e.g., objects involved, human goals/intents — thus leading to knowledge gaps in joint planning. We consider the problem of discovering optimal interaction strategies for AI agents to actively elicit human inputs in object-driven planning. To this end, we propose Minimal Information Neuro-Symbolic Tree (MINT) to reason […]

Ver mais

Like 0

Liked Liked

technocracy

4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview

digitado ⋅ 16 de April de 2026

arXiv:2604.13244v1 Announce Type: new Abstract: The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons […]

Ver mais

Like 0

Liked Liked

technocracy

Why AI Agent Reliability Depends More on the Harness Than the Model

digitado ⋅ 25 de February de 2026

I keep hearing the same question at every engineering offsite, Slack thread, and investor pitch: “What’s the best model right now — GPT, Claude, or Gemini?” I spent the last several months building and debugging agent-based systems, and I think this is the wrong question entirely. The evidence is now overwhelming: what determines whether an AI agent succeeds in production is not the model underneath it, but the infrastructure wrapped around it. I am going to lay out my hypothesis, test […]

Ver mais

Like 0

Liked Liked

technocracy

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

digitado ⋅ 9 de April de 2026

Despite significant advances in Large Reasoning Models (LRMs) driven by reinforcement learning with verifiable rewards (RLVR), this paradigm is fundamentally limited in specialized or novel domains where such supervision is prohibitively expensive or unavailable, posing a key challenge for test-time adaptation. While existing test-time methods offer a potential solution, they are constrained by learning from static query sets, risking overfitting to textual patterns. To address this gap, we introduce Test-Time Variational Synthesis (TTVS), a novel framework that enables […]

Ver mais

Like 0

Liked Liked