HB-Eval: From Benchmark to Reliability Operating System—A Five-Metric Framework with Triple-Methodology Validation, SIL/ASIL Certification, and Production-Grade Deployment

Background: Agentic AI systems are deployed in safety-critical domains where operational reliabil-ity under fault conditions determines patient safety, system integrity, and infrastructure continuity. Current evaluation paradigms measure nominal task-completion capability exclusively, providing no mechanism for estimating the capability–reliability gap ∆(π) = Cnom(π) − Rop(π) that separates benchmark performance from operational performance. Methods: We present HB-Eval OS, a five-metric Reliability Operating System comprising a secured evaluation Gateway, Evaluation-Driven Memory (EDM), and a production SDK (pip install hb-eval-sdk v2.0.0) integrating AES-256-GCM encryption and Safe Halt protocol. Three fully independent validation methodologies were applied across 14,000 evaluations: Methodology A (6,000 behavioural trajectory experiments across six open-weight architectures and six safety-critical domains), Methodology B (4,998 three-layer con-straint verification assessments across five frontier open-weight models), and Methodology C (3,002 evaluations of GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash judged by an independent third-party model). A fifth diagnostic metric—the Consistency Stability Index (CSI)—is introduced to quantify temporal performance stability across sequential runs. Results: Methodologies A and B converge on aggregate reliability near 36% (z=0.653, p=0.514, 95% CI ±1.80 pp), confirming the deficit is not a methodological artefact. Methodology C establishes gaps of +7.6 pp, +10.6 pp, and +22.5 pp for GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash respectively across 14 architectures from five organi-sations. The Intentional Recovery Score (IRS) reveals that only 23% of recoveries are memory-guided; the remaining 77% degrade 55 pp under distribution shift. Cascade fault injection imposes a 21.6 pp reliability penalty (z=10.80, p<0.001). A live Gemini API case study demonstrates transition from UNSAFE (PEI = 0.67) to SAFE (PEI = 1.00) through single-prompt refinement guided by HB-Eval OS attribution. Conclusions: No evaluated model qualifies for Tier 2 or Tier 3 SIL/ASIL certification. The 55 pp IRS distribution-shift divergence and 21.6 pp cascade penalty identify specific, actionable architectural targets. Complete protocols, all 14,000 evaluation records, and the production SDK are released open source.

Liked Liked