Agent Behavior Lab: a reproducible harness for factorial experiments on tool-using LLM agent behavior
I’ve open-sourced Agent Behavior Lab, a platform for running controlled, factorial experiments on tool-using LLM agents.
Design. An experiment is a Cartesian product of factors — model × tool (with renamed alias variants) × system persona × prior conversation history — run for N trials per cell. Each trial is scored by a configurable judge (deterministic keyword/tool-call detection or an LLM judge).
Analysis. The harness reports safety failure rate and severity-weighted failure rate with Wilson intervals, per-factor breakdowns, cross-factor SFR heatmaps, and effect sizes (alias/persona/history) alongside logistic-regression coefficients and odds ratios. Raw trials and CSV export are available for external analysis.
Reproducibility. Ships with versioned YAML seed data (models, tools, personas, histories, judges, and pre-built sessions) so the reference experiments are reproducible out of the box. It sends tool definitions to model APIs and records attempted calls; it does not execute tools.
Repo: https://github.com/Null-Square/agent-behavior-lab
Interested in feedback on the statistical treatment (effect-size definitions, handling of saturated cells, multiple-comparison concerns) and on additional judge designs.
submitted by /u/IcyPop8985
[link] [comments]